SlideShare a Scribd company logo
1 of 92
Download to read offline
EVOLUTIONARY COMPUTING CMT563




         Antonia J. Jones




                                6 November 2005
Antonia J. Jones: 6 November 2005




UNIVERSITY OF WALES, CARDIFF
DEPARTMENT OF COMPUTER SCIENCE (COMSC)


         COURSE:                     M.Sc. CMT563
         MODULE:                     Evolutionary Computing
         LECTURER:                   Antonia J. Jones, COMSC
         DATED:                      Originally 15 January 1997
         LAST REVISED:               6 November 2005
         ACCESS:                     Lecturer (extn 5490, room N2.15).

Overhead slides are posted on:

                                    http://users.cs.cf.ac.uk:81/Antonia.J.Jones/

electronically as pdf Acrobat files. It is not normally necessary for students attending the course to print this file
as complete sets of printed slides will be issued.

©2001 Antonia J. Jones. Permission is hereby granted to any web surfer for downloading, printing and use of this
material for personal study only. Copyright permission is explicitly withheld for modification, re-circulation or
publication by any other means, or commercial exploitation in any manner whatsoever, of this file or the material
therein.

Bibliography:

MAIN RECOMMENDATIONS

The recommended text for the course is:

    [Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing.
    Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk).


A cheaper alternative is

Yoh-Han Pao, Adaptive pattern recognition and neural networks. Addison-Wesley, 1989. ISBN 0-201-12584-6.
Price (UK) £31.45.

A useful addition for the Mathematica labs is:

    Simulating Neural Networks with Mathematica. James A. Freeman. Addison-Wesley. 1994. ISBN 0-201-
    56629-X.


These books cover most of the course, except any theory on genetic algorithms, and the first is the recommended
book for the course because it has excellent mathematical analyses of many of the models we shall discuss. The
second includes some interesting material on the application Bayesian statistics and Fuzzy logic to adaptive pattern
recognition. It is clearly written and the emphasis is on computing rather than physiological models.



                                                          1
Antonia J. Jones: 6 November 2005

The principle sources of inspiration for work in neuro and evolutionary computation are:

         ! E. R. Kandel, J. H. Schwartz, and T. M. Jessel. Principles of Neural Science (Third Edition),
         Prentice-Hall Inc., 1991. ISBN 0-8385-8068-8.

         ! J. D. Watson, Nancy H. Hopkins, J. W. Roberts, Joan A. Steitz, and A. M. Weiner. Molecular
         Biology of the Gene, Benjamin/Cummings Publishing Company Inc., 1988. ISBN 0-8053-
         9614-4.

When you see how big they are you will understand why! It is a sobering thought that most of the knowledge in
these tomes has been obtained in the last 20 years.

Although extensive references are provided with the course notes (these are also a useful source of information for
projects in Neural Computing) definitive bibliographies for computing aspects of the subject are:

The 1989 Neuro-Computing Bibliography. Ed. Casimir C. Klimasauskas, MIT Press / Bradford Books. 1989. ISBN
0-262-11134-9.

Finally, the key papers up to 1988 can be found together in:

Neurocomputing: Foundations of Research. Ed. James A. Anderson and Edward Rosenfeld, MIT Press 1988.
ISBN 0-262-01097-6.

NETS - OTHER (HISTORICALLY) INTERESTING MATERIAL

Perceptrons, Marvin Minsky and Seymour Papert, MIT Press 1972. ISBN 0-262-63022-2 (was reprinted recently).

Neural Assemblies, G. Palm, Springer-Verlag, 1982.

Self-Organisation and Associative Memory, T. Kohonen, Springer- Verlag, 1984.

Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, Lawrence Erlbaum, 1981.

Connectionist Models and Their Applications, Special Issue of Cognitive Science 9, 1985.

Computer, March 1988. Artificial Neural Systems, IEEE.

Neural Computing Architectures, Ed. I. Aleksander, Kogan Page, December, 1988.

Parallel Distributed Processing. Vol.I Foundations. Vol. II Psychological and Biological Models. David E.
Rumelhart et. al., MIT Press / Bradford Books. 1986. ISBN 0-262-18123-1 (Set).

Explorations in Parallel Distributed Processing - A Handbook of Models, Programs, and Exercises. James L.
McClelland and David E. Rumelhart, MIT Press / Bradford Books. 1988. ISBN 0-262-63113X. (Includes some
very useful software for an IBM PC - there is also a newer version with software for the MAC).

GENERAL

An Introduction to Cybernetics, W. Ross-Ashby, John Wiley and Sons, 1964.

A classic text on cybernetics.

Vision: A computational investigation into the human representation and processing of visual information, David


                                                        2
Antonia J. Jones: 6 November 2005

Marr, W. H. Freeman and Company, 1982. ISBN 0-7167-1284-9.

One of the classic works in computational vision.

Artificial Intelligence, F. H. George, Gordon & Breach, 1985.

Useful textbook on AI.

GENETIC ALGORITHMS/ARTIFICIAL LIFE

Artificial Life, Ed. Christopher G. Langton, Addison-Wesley 1989. ISBN 0-201-09356-1 pbk.

A fascinating collection of essays from the first AL workshop at Los Alamos National Laboratory in 1987. The
book covers an enormous range of topics (genetics, self-replication, cellular automata, etc.) on this subject in a very
readable way but with great technical authority. There are innumerable figures, some forty colour plates and even
some simple programs to experiment with. All this leads to a book that is beautifully presented and compulsive
reading for anyone with a modest background in the field.

Synthetic systems that exhibit behaviour characteristic of living systems complement the traditional analysis of
living systems practised by the biological sciences. It is an approach to the study of life that would hardly be
feasible without the advent of the modern computer and may eventually lead to a theory of living systems which
is independent of the physical realisation of the organisms (carbon based, in this neck of the woods).

The primary goal of the first workshop was to collect different models and methodologies from scattered
publications and to present as many of these as possible in a uniform way. The distilled essence of the book is the
theme that Artificial Life involves the realisation of lifelike behaviour on the part of man-made systems consisting
of populations of semi-autonomous entities whose local interactions with one another are governed by a set of
simple rules. Such systems contain no rules for the behaviour of the population at the global level.

Adaptation in Natural and Artificial Systems, John H. Holland, University of Michigan Press, 1975.

The book that started Genetic Algorithms, a classic.

Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing 1987. ISBN 0-273-08771-1
(UK), 0-934613-44-3 (US).

A collection of interesting papers on GA related subjects.

Genetic Algorithms in Search, Optimization, and Machine Learning, David E. Goldberg, Addison-Wesley, 1989.
ISBN 0-201-15767-5.

The first real text book on GAs.




                                                          3
Antonia J. Jones: 6 November 2005


                                                                     CONTENTS


I What is evolutionary computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
         Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
         A general framework for neural models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
         Hebbian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
         The need for machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

II Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     14
         Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   14
         The archetypal GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        14
         Design issues - what do you want the algorithm to do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                          18
                  Rapid convergence to a global optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         19
                  Produce a diverse population of near optimal solutions in different `niches' . . . . . . . . . . .                                        19
         * Results and methods related to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                   20
         Evolutionary Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  21
         Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       25

III Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   29
         Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   29
         Hopfield nets and energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           29
         The outer product rule for assigning weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    31
         Networks for combinatoric search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 32
         Assignment of weights for the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  33
         * The Hopfield and Tank application to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         36
         Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    37
         Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       37

IV The WISARD model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           40
       Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     40
       Wisard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         41
       WISARD - analysis of response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  43
       Comparison of storage requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     44
       Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         45

V Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        46
        Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    46
        Backpropagation - mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         46
                 The output layer calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 47
                 The rule for adjusting weights in hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            48
        The conventional model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             48
        Problems with backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 49
        The gamma test - a new technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  50
        * Metabackpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             53
        * Neural networks for adaptive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  53
        Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        58

* VI The chaotic frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     59
        Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    59
        Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               59
        Chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   60


                                                                              4
Antonia J. Jones: 6 November 2005

            Chaos in biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      61
            Controlling chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       62
            The original OGY control law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              62
            Chaotic conventional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  64
            Controlling chaotic neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 65
                     Control varying T in a particular layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    67
                     Using small variations of the inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                   67
            Time delayed feedback and a generic scheme for chaotic neural networks . . . . . . . . . . . . . . . . . . .                                      70
                     Example: Controlling the Hénon neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                              71
            Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      73

COURSEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


                                                                 LIST OF FIGURES


Figure 1-1 The stylised version of a standard connectionist neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec]. . . . . . . . . . . . . . . . . 12
Figure 1-3 Storage capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2-1 Generic model for a genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure 2-2 Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 2-3 Premature convergence - no sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1. . . . . . . . . . . . . . . . 22
Figure 2-5 The EDAC (top) and simple 2-Opt (bottom) time complexity (log scales). . . . . . . . . . . . . . . . . . 23
Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem. . . . . . . . . . . . . . . . . . . . 24
Figure 2-7 EDACII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . . 25
Figure 2-8 EDACIII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . 25
Figure 3-1 Distance Connections. Each node (i, p) has inhibitory connections to the two adjacent columns whose
        weights reflect the cost of joining the three cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 3-2 Exclusion connections. Each node (i, p) has inhibitory connections to all units in the same row and
        column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Figure 4-1 Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 4-2 Continuous response of discriminators to the input word 'toothache' [From Neural Computing
        Architectures, Ed. I Aleksander]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 4-3 A discriminator for centering on a bar [From Neural Computing, I. Aleksander and H. Morton].
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 5-1 Solving the XOR problem with a hidden unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 5-2 Feedforward network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 5-3 The previous layer calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 5-4 The Water Tank Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Figure 5-5 Architecture for direct inverse neurocontrol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs for the Water Tank
        Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 5-7 Least squares fit to 200 data points with 20 nearest neighbours: = 0.0332. . . . . . . . . . . . . . . . . 56
Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear Planner. . . . 57
Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052. Linear Planner.
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training. Linear Planner.
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 6-1 Stable attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 6-2 A chaotic time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Figure 6-3 The butterfly effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

                                                                                5
Antonia J. Jones: 6 November 2005

Figure 6-4 Intervals for which the variables are defined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 6-5 Feedforward network as a dynamical system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 6-6 Chaotic attractor of Wang's neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 6-7 The Ikeda strange attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 6-8 Attractor for the chaotic 2-10-10-2 neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 6-9 Bifurcation diagram x obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . . 68
Figure 6-10 Bifurcation diagram y obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . 68
Figure 6-11 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-12 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-13 Parameter changes during output layer control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-14 Bifurcation diagram for the output x(t+1) using an external variable added to the input x(t). . . 69
Figure 6-15 Bifurcation diagram for the output y(t+1) using an external variable added to the input x(t). . . 69
Figure 6-16 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-17 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-18 Parameter changes during input x control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network: the chaotic
        "delayed" network is trained on suitable input-output data constructed from a chaotic time series; a
        delayed feedback control is applied to each input line; entry points for external stimulus are suggested,
        with a switch signal to activate the control module during external stimulation; signals on the delay lines
        or output can be observed at the "observation points". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-20 The control signal corresponding to the delayed feedback control shown in Figure 6-21. Note that
        the control signal becomes small. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-21 Response signal on x(n-6) with control signal activated on x(n-6) using k = 0.441628, = 2 and                                          J
        without external stimulation after first 10 transient iterations. After n = 1000 iterations, the control is
        switched off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Figure 6-22 Response signals on network output x(n), with control signal activated on x(n-6) using k = 0.441628,
          = 2 and with constant external stimulation sn added to x(n-6), where sn varies from -1.5 to 1.5 in steps
            J
        of 0.1 at each 500 iterative steps (indicated by the change of Hue of the plot points) after 20 initial
        transient steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 6-23 The control signal corresponding to the delayed feedback control shown in Figure 6-22. Note that
        the control signal becomes small even when the network is under changing external stimulation. . 72
Figure 6-24 Response signals on network output x(n), with control setup same as in Figure 6-22 but with
        Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.05, at each iteration step. . . . . 73
                                                                                                       F
Figure 6-25 The control signal corresponding to the delayed feedback control shown in Figure 6-24. . . . . 73
Figure 6-26 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but
        with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.15, at each iteration step.   F
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 6-27 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but
        with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.3, at each iteration step.   F
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

                                                            LIST OF ALGORITHMS

Algorithm 2-1 Archetypal genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                     16
Algorithm 3-1 Hopfield network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               31
Algorithm 5-1 The Gamma test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               52
Algorithm 5-2 Metabackpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                  53
Algorithm 7-1 Generic GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            84
Algorithm 7-2 Generic Hopfield net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               86



                                                                                6
Antonia J. Jones: 6 November 2005




                                                        I What is evolutionary computing?


         "
         dna tsap naht erutan enon-ro-lla na fo yldigir ssel hcum era hcihw seiroeht ot dael lliw siht fo llA
         .retcarahc ,lacitylana erom hcum dna ,lacirotanibmoc ssel hcum a fo eb lliw yehT .cigol lamrof tneserp
         ll i w cigol lamrof fo metsys wen siht taht eveileb su ekam ot snoitacidni suoremun era ereht ,tcaf nI
         si sihT .cigol htiw tsap eht ni deknil elttil neeb s a h h c i h w e n i l picsid rehtona ot resolc evom
         fo trap taht si ti dna ,nnamztloB morf deviecer saw ti mrof eht ni yliramirp ,scimanydomreht
         g n i r u s a e m d n a g n i t al u p i n a m o t s t c e p s a s t i f o e m o s n i t s e r a e n s e m o c h c i h w s c i s y h p l a c i t e r o e h t
                                                               ]403 .p ,5 .loV skroW detcelloC ,nnamueN nov[ ".noitamrofni



Introduction.

Evolutionary computing embraces models of computation inspired by living Nature. For example, evolution of
species by means of natural selection and the genetic operators of mutation, sexual reproduction and inversion can
be considered as a parallel search process. Perhaps we can tackle hard combinatoric search problems in computer
science by mimicking (in a very stylised form) the natural process of evolutionary search.

Evolution through natural selection drives the adaptation of whole species, but individual members of a species
can also adapt to a greater or lesser extent. The adaptation of individual behaviour on the basis of experience is
learning and stems from plasticity of the neural structures which convey and process information in animals.

Learning enables us to recognise previously encountered stimuli or experiences and modify our behaviour
accordingly. It facilitates prediction and control of the environment, both essential prerequisites to planning. All
of these are facets of what we loosely call intelligence.

Real-world intelligence is essentially a computational process. This is a contentious assertion known as "the strong
AI position". If it is true then the precise mechanism of computation (the hardware or neural wetware) ought to
be irrelevant to the actual principles of the computational process.

If this is indeed the case then the only obstacles to the construction of a truly intelligent artifact are our own
understanding of the computational processes involved and our technical capability to construct suitable and
sufficiently powerful computational devices.

A general framework for neural models.

Throughout this course we describe a number of neural models, each a variation on the connectionist paradigm
(often called Parallel Distributed Processing - PDP), which in turn is derived from networks of highly stylised
versions of the biological neuron.

It is useful to begin with an analysis of the various components of these models. There are seven major aspects of
a connectionist model:

         !   A set of processing units ui, each producing a scalar output xi(t) (1                                          # #i     n).

         ! A connectivity graph which determines the pattern of connections (links) from each unit to each of the
         other units in the network. We shall often suppose that each unit has n inputs, but there is no particular
         reason why all units should have the same number of inputs.


                                                                                     7
Antonia J. Jones: 6 November 2005

Although it is often convenient for theoretical discussions to consider fully interconnected networks, for very large
networks of either real or artificial neurons the relevant case is that of relatively sparse connectivity. The
connectivity graph then describes the fine topology of the network. This can be useful in practical applications,
for example in speech recognition networks it is often helpful to have several copies of the same sub-net connected
to temporally distinct inputs. These sub-net copies act as a feature detector and so can share their weights - this
effectively reduces the number of parameters needed to describe the full network and speeds up learning. It is
sufficient to be given a list of inputs and outputs for each node, for we then can recover the connectivity graph.

         !  A set of parameters pi1,...,pik, fixed in number, attached to each unit ui, which are adjusted during
         learning. Most commonly k = n and the parameters are weights wij (1 j n), where wij is often taken
                                                                                     # #
         to be associated with the link from j to i, or in biological terms associated with the synaptic gap.

         ! An activation function for each unit, neti = neti(x1,...,xn;pi1,...,pik), which combines the inputs to ui into
         a scalar value. In the commonly used model neti = wijxj.
                                                                '

It is important to realise that the basic principle of neural networks is that of simple (but arbitrary) computational
function at each node. Learning when it occurs can be considered as an adjustment of the parameters associated
with a node based on information locally available to the node. ‘Locally’ here means as specified by the
connectivity graph. This information often takes the form of a correlation between the firings of adjacent nodes,
but it could be a more sophisticated calculation. Thus we are really dealing with a very general class of parallel
algorithms. The concentration on the ‘weights associated with links’ model has arisen partly because of the
biological precedent, because of the extreme simplicity of the computational function of a node, and because this
special case has been shown to be of practical interest.



                               x1(t) x2(t) x3(t)                           ...                      xn(t)



                                                neti = neti(x1, ... ,xn, pi1, ... , pik)


             unit i                                                 output xi
                                        xi = f(neti)                         neti
                                                                           input
                                                   Sigmoidal output function



                              xi(t+1)          ...             xi(t+1)           ...            xi(t+1)

       Figure 1-1 The stylised version of a standard connectionist neuron.

         ! An output function xi = f(neti) which transforms the activation function into an output. In the earliest
         models f was a discontinuous step function. However, this poses analytical difficulties for learning
         algorithms so that often now f is a smooth sigmoidal shaped function. In some models f is allowed to vary


                                                           8
Antonia J. Jones: 6 November 2005

         from one unit to another and so then we write fi for f.

         ! A learning rule whereby the parameters associated with each processing unit are modified by
         experience.

         !   An environment within which the system must operate.

A set of processing units. Figure 1-1 illustrates a standard connectionist component. All of the processing of a a
connectionist system is carried out by these units. There is no executive or overseer. There are only relatively
simple units, each doing its own relatively simple job. A unit's job is simply to receive input from other units and,
as a function of the input it receives and the current values of its internal parameters, to compute an output value
xi which it sends to the other units. This output is discrete in some models and continuous in others. When the
output is continuous it is often confined to [0,1] or [-1,1]. The system is inherently parallel in that many units carry
out their computations at the same time.

Within any system we are modelling, it is sometimes useful to characterize three types of units: input, output, and
hidden. The hidden units are those whose inputs and outputs are within the system we are modelling. They are not
‘visible’ to outside systems.

A connectivity graph. Each unit passes its output to other units along links. The graph of links represents the
connectivity of the network.

A set of parameters and an activation function. In the conventional model the parameters for unit i are assumed
to be weights wij associated with the link from unit j to unit i. If wij > 0 the link is said to be an excitatory link, if
wij = 0 unit j is effectively not connected to unit i, and if wij < 0 the link is said to be inhibitory link. In this case
neti is calculated as
                                                                           n
                                                       net i   j '                 wijx j                            (1)
                                                                       j   '   1



This is a linear function of the inputs and so neti is constant over hyperplanes in the n-dimensional space of inputs
to unit i.

In fact, if one is interested in generalising the computational function of a unit, it is often convenient to associate
the parameters (in the conventional case weights) with the unit. In which case one thinks of the links as passing
activation values and one is no longer constrained to have exactly n (the number of inputs) parameters per unit.
For example, one could have a unit which performed it's distinction function by determining whether or not the
input vector lay within some ellipsoid. In this case there would be n parameters associated with the centre of the
ellipsoid and another n parameters associated with the axes. (In addition one could provide the ellipsoid with
rotations which would provide further parameters.) Now the activation function would look like
                                                               n
                                               net i   j '             Aij (x j      &      cij)2                    (2)
                                                           j   '   1



This is a simple example of a higher order network in which the function neti is not a linear function of the inputs.

An output function. The simplest possible output function f would be the identity function, i.e. just take xi = neti.
However, in this case with the activation function (1) the unit would be performing a totally linear function on the
inputs and, as it turns out, such nets are rather uninteresting.

In any event our unit is not yet making a distinction. In the discrete model the output function is usually




                                                                       9
Antonia J. Jones: 6 November 2005

                                               xi   '   1              >   θ
                                                                           i
                                                            if net i                                              (3)
                                               xi   '   0              #   θ   i


where i is the threshold, a parameter associated with the unit. However, this creates discontinuities of the
       2
derivatives and so we usually smooth the output function and write
                                                  x i f(net i)
                                                            '                                           (4)
In the linear case f is some sort of sigmoidal function. For our ellipsoidal example Gaussian smoothing might be
suitable, i.e. f(x) = exp(-x2), so that the output is large (near one) when the input vector is near the centre of the
ellipsoid.

Sometimes the output function is stochastic so that the output of the unit depends in a probabilistic fashion on neti.

For an individual unit the sequence of events in operational mode (not learning) is

           1. Combine inputs to produce activation neti(t).
           2. Compute value of output xi = f(neti).
           3. Place outputs, based on new activation level, on output links (available from t+1 onward).

Changing the processing or knowledge structure in a connectionist model involves modifying the patterns of
interconnections or parameters associated with each unit. This is accomplished by modifying pi1,...,pik (or the wij
in the usual model) through experience using a learning rule.

Virtually all learning rules are based on some variant of a Hebbian principle (discussed in the next section) which
is invariably derived mathematically through some form of gradient descent. For example, the Delta or
Widrow-Hoff rule. Here modification of weights is proportional to the difference between the actual activation
achieved and the target activation provided by a teacher

                                              )   wij = (ti(t)-neti(t))xj(t),
                                                        0

where > 0 is constant. This is a generalization of the Perceptron learning rule and is all very well provided we
       0
know the desired values of ti(t).

Hebbian learning.

Donald O. Hebb's book The Organization of Behavior (1949) is famous among neural modelers because it
contained the first explicit statement of the physiological learning rule for synaptic modification that has since
become known as the Hebb synapse:

           Hebb rule. When an axon of a cell A is near enough to excite a cell B and repeatedly or persistently takes
           part in firing it, some growth process or metabolic change takes place in one or both cells such that A's
           efficiency, as one of the cells firing B, is increased.

The physiological basis for this synaptic potentiation is now understood more clearly [Brown 1988]. Hebb's
introduction to the book also contains the first use of the word 'connectionism' in the context of neural modeling.
The Hebb rule is not a mathematical statement, though it is close to one. For example, Hebb does not discuss the
various possible ways inhibition might enter the picture, or the quantitative learning rule that is being followed.
This has meant that a number of quite different learning rules can legitimately be called 'Hebbian rules'. We shall
see later that nearly all such learning rules bear a close mathematical relationship to the idea of `gradient descent',
which roughly means that if we wish to move to the lowest point of some error surface a good heuristic is: we
should always tend to go `downhill'. However, for the present chapter we shall conceptualise the Hebb rule in terms
of autocorrelations, i.e. the internal correlations between each pair of components of the pattern vectors we wish


                                                                10
Antonia J. Jones: 6 November 2005

the system to memorise.

Hebb was keenly aware of the `distributed' nature of the representation he assumed the nervous system uses; that
to represent something assemblies of many cells are required and that an individual cell may be a participant
member of many representations at different times. He postulated the formation of cell assemblies representing
learned patterns of activity.



The need for machine learning.

Why do we need to discover how to get machines to learn? After all, is it not the case that the most practical
developments in Artificial Intelligence, such as Expert Systems, have emerged from the development of advanced
symbolic programming languages such as LISP or Prolog? Indeed, this is so. But there are convincing arguments
[Bock 1985] which suggest that the technique of simulating human skills using symbolic programs cannot hope,
in the long run, to satisfy the principal goals of AI. Mainly these centre around the time it would take to figure out
the rules and write the software. But first we should consider the evolution of hardware.

How can one measure the overall computational power of an information processing system? There are two obvious
aspects we should consider. Firstly, information storage capacity - a system cannot be very smart if it has little or
no memory. On the other hand, a system may have a vast memory but little or no capacity to manipulate
information; so a second essential measure is the number of binary operations per second. On these two scales
Figure 1-2 illustrates the information processing capability of some familiar biological and technological
information processing systems. In the case of the biological systems these estimates are based on connectionist
models and may be excessively conservative.

We consider each axis independently. As we saw earlier, research in neurophysiology has revealed that the brain
and central nervous system consists of about 1011 individual parallel processors, called neurons. Each neuron has
roughly 104 synaptic connections and if we allow only 1 bit per synapse then each neuron is capable of storing
about 104 bits of information. The information capacity of the brain is thus about 1015 bits. Much of this
information is probably redundant but using this figure as a conservative estimate let us consider when we might
expect to have high-speed memories of 1015 bits.




                                                         11
Antonia J. Jones: 6 November 2005




      Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec].

Figure 1-3 shows that the amount of high-speed
random access memory that may be conventionally
accessed by a large computer has increased by an order
of magnitude every six years. If we can trust this
simple extrapolation, in generation thirteen, AD
2024-30, the average high speed memory capacity of a
large computer will reach 1015 bits.

Now consider the evolution of technological processing
power. Remarkably, this follows much the same trend.
Of course, the real trick is putting the two together to
achieve the desired result, it seems relatively unlikely
that we shall be in a position to accomplish this by
2024.

So much for the hardware. Now consider the software.
Even adult human brains are not filled to capacity. So
we will assume that 10% of the total capacity, i.e. 1014
bits, is the extent of the `software' base of an adult
human brain. How long will it take to write the
                                                           Figure 1-3 Storage capacity.

                                                           12
Antonia J. Jones: 6 November 2005

programs to fill 1014 bits (production rules, knowledge bases etc.)? The currently accepted rate of production of
software, from conception through testing, de-bugging and documentation to installation, is about one line of code
per hour. Assuming, generously, that an average line of code contains approximately 60 characters, or 500 bits,
we discover that the project will require 100 million person years! We'll never get anywhere by trying to program
human intelligence into a machine.

What other options are available? One is direct transfer from the human brain to the machine. Considering
conventional transfer rates over a high speed bus this would take about 12 days. The only problem is: nobody has
the slightest idea how to build such a device.

What's left? In the biological world intelligence is acquired every day, therefore there must be another alternative.
Every day babies are born and in the course of time acquire a full spectrum of intelligence. How do they do it? The
answer, of course, is that they learn.

If we assume that the eyes, our major source of sensory input, receive information at the rate of about 250,000 bits
per second, we can fill the 1014 bits of our machine's memory capacity in about 20 years. Now storing sensory input
is not the same thing as developing intelligence, however this figure is in the right ball park. Maybe what we must
do is connect our machine brain to a large number of high-data-rate sensors, endow it with a comparatively simple
algorithm for self organization, provide it with a continuous and varied stream of stimuli and evaluations for its
responses, and let it learn.

This argument may seem cavalier in some aspects. The human brain is highly parallel and somewhat
inhomogeneous in its architecture. It does not clock at high serial speeds and does not access RAM to recall
information for processing. The storage capacity may be vastly greater than the 1015 bits estimated by Sagan, since
each neuron is connected to as many as 10,000 others and the structure of these interconnections may also store
information. Indeed, although we do not know a great deal about the mechanisms of human memory, we do know
that is multi-levelled with partial bio-chemical storage. However, none of this invalidates Bock's point that
programming can never be a substitute for learning.




                                                         13
Antonia J. Jones: 6 November 2005




                                              II Genetic Algorithms



Introduction.

The idea that the process of evolutionary search might be used as a model for hard combinatoric search algorithms
developed significantly in the mid 1960's. Evolutionary algorithms fall into the class of probabilistic heuristic
algorithms which one might use to attack NP-complete or NP-hard problems (see, for example [Horowitz 1978],
Chapters 11 and 12), such as the Travelling Salesman/person Problem (TSP). Of course, many of these problems
have significant applications in engineering hardware or software design and commercial optimisation problems,
but the underlying motivation for the study of evolutionary algorithms is principally to try to gain insight into the
evolutionary process itself.

Variously known as genetic algorithms, the phrase coined by the US school stemming from the work of John
Holland [Holland 1975], evolutionary programming, originally developed by L. J. Fogel, A. J. Owens and M. J.
Walsh, again in the US, and Evolutionsstrategie, as studied in Germany at around the same time by I. Rechenberg
and H-P. Schwefel [Schwefel 1965], the subject has exploded over the last 15 years. Curiously, the European and
US schools seemed largely unaware of each others existence for quite some while.

Evolutionary algorithms, have been applied to a variety of problems and offer intriguing possibilities for general
purpose adaptive search algorithms in artificial intelligence, especially, but not necessarily, for situations where
it is difficult or impossible to precisely model the external circumstances faced by the program. Search based on
evolutionary models had, of course, been tried before Holland's introduction of genetic algorithms. However, these
models were based on mutation and not notably successful. The principal difference of the more modern research
is an emphasis on the power of natural selection and the incorporation of a ‘crossover’ operator to mimic the effect
of sexual reproduction.

Two rather different types of theoretical analysis have developed for evolutionary algorithms: the classical
approach stemming from the original work of Mendel on heritability and the later statistical work of Galton and
Pearson at the end of the last century, and the Schema theory approach developed by Holland.

Mendel constructed a chance model of heritability involving what are now called genes. He conjectured the
existence of genes by pure reasoning - he never saw any. Galton and Pearson found striking statistical regularities
in heritability in large populations, for example, on average a son is halfway between his father's height and the
overall average height for sons. They also invented many of the statistical tools in use today such as the scatter
diagram, regression and correlation (see, for example, [Freedman 1991]). Around 1920 Fisher, Wright and
Haldane more or less simultaneously recognised the need to recast Darwinian theory as described by Galton and
Pearson in Mendelian terms. They succeeded in this task, and more recently Price's Covariance and Selection
Theorem [Price 1970], [Price 1972], an elaboration of these ideas, has provided a useful tool for algorithm analysis.

The archetypal GA.

In Nature each gene has several forms or alternatives - alleles - producing differences in the set of characteristics
associated with that gene, e.g. certain strains of garden pea have a single gene which determines blossom colour,
one allele causing the blossom to be white, the other pink. There are tens of thousands of genes in the
chromosomes of a typical vertebrate, each of which, on the available evidence, has several alleles. Hence the set
of chromosomes attained by taking all possible combinations of alleles contains on the order of 10 to the 3,000
structures for a typical vertebrate species. Even a very large population, say 10 billion individuals, contains only
a minuscule fraction of the possibilities.


                                                         14
Antonia J. Jones: 6 November 2005

A further complication is that alleles interact so that adaptation becomes primarily the search for co-adapted sets
of alleles. In the environment against which the organism is tested any individual exemplifies a large number of
possible `patterns of co-adapted alleles' or schema, as Holland calls them. In testing this individual we shall see
that all schema of which the individual is an instantiation are also tested. If the rules whereby genes are combined
have a tendency to generate new instances of above average schema then the resulting adaptive system has a high
degree of `intrinsic parallelism'1 which accelerates the evolutionary process. Considerations of this type offer an
explanation of how evolution can proceed at all. If a simple enumerative plan were employed and if 10 to the 12
structures could be tried every second it would take a time vastly exceeding the estimated age of the universe to
test 10 to the 100 structures.

The basic idea of an evolutionary algorithm is illustrated in Figure 2-1.




                                                            INITIALISE
                                                       Create initial population
                                                       Evaluate fitness of each
                                                       member.




                                 INTERNAL                                             EXTERNAL

                          Create children from
                           existing population
                          using genetic operators

                                                                               Evaluate fitness of children

                            Substitute children
                            in population deleting
                            an equivalent number




                 Figure 2-1 Generic model for a genetic algorithm.


We seek to optimise members of a population of ‘structures’. These structures are encoded in some manner by a
‘gene string’. The population is then ‘evolved’ in a very stylised version of the evolutionary process.

We are given a set, A, of `structures' which we can think of, in the first instance, as being a set of strings of fixed
length l. The object of the adaptive search is to find a structure which performs well in terms of a measure of
performance v : A --> +, where + denotes the positive real numbers.
                            ú             ú


   1
     The notion of 'intrinsic parallelism' will be discussed but it should be mentioned that it has nothing to do with parallelism in the sense
normally intended in computing.

                                                                     15
Antonia J. Jones: 6 November 2005

The programmer must provide a representation for the structures to be optimised. In the terminology of genetic
algorithms a particular structure is called a phenotype and its representation as a string is called a chromosome
or genotype. Usually this representation consists of a fixed length string in which each component, or gene, may
take only a small range of values, or alleles. In this context `small' often means two, so that binary strings are used
for the genotypes.

There is nothing obligatory in taking a one-bit range for each allele but there are theoretical reasons to prefer
few-alleles-at-many-sites over many-alleles-at-few-sites (the arguments have been given by [Holland 1975](p. 71),
[Smith 1980](p. 56) and supporting evidence for the correctness of these arguments has been presented by
[Schaffer 1984](p. 107).




                1. Randomly generate a population of M structures

                                                  S(0) = {s(1,0),...,s(M,0)}.

                2. For each new string s(i,t) in S(t), compute and save its measure of utility v(s(i,t)).

                3. For each s(i,t) in S(t) compute the selection probability defined by

                          p(i,t) = v(s(i,t))/ (
                                              E   i   v(s(i,t)) ).

                4. Generate a new population S(t+1) by selecting structures from S(t) via the selection
                probability distribution and applying the idealised genetic operators to the structures
                generated.

                5. Goto 2.


   Algorithm 2-1 Archetypal genetic algorithm.

The function v provides a measure of ‘fitness’ for a given phenotype and (since the programmer must also supply
a mapping from the set of genotypes to the set of phenotypes) hence for a given genotype. Given a particular
            n
genotype or string the goal function provides a means for calculating the probability that the string will be selected
to contribute to the next generation. It should be noted that the composition function v( ) mapping genotypes to
                                                                                              n
fitness is invariably discontinuous, nevertheless genetic algorithms cope remarkably well with this difficulty.

The basis of Darwinian evolution is the idea of natural selection i.e. population genetics tends to use the

         Selection Principle. The fitness of an individual is proportional to the probability that it will
         reproduce effectively.2

In genetic algorithm design we tend to apply this in the converse form: the probability that an individual will
reproduce is proportional to its fitness. ‘Fit’ strings, i.e. strings having larger goal function values, will be more
likely to be selected but all members of the population will have some chance to contribute.



    2
       Obfuscation of the definition of ‘fitness’ occurs frequently in the classical literature. The reasons are not
difficult to understand. Both Darwin and Fisher found it hard to swallow that the lower classes bred more
prolifically and were therefore, by definition, ‘fitter’ than their ‘social superiors’. This confusion regarding ‘fitness’
still occurs in the GA’s literature for different reasons.

                                                                     16
Antonia J. Jones: 6 November 2005

The box contains a sketch of the standard serial style genetic algorithm. Typically the evaluation of the goal
function for a particular phenotype, a process which strictly speaking is external to the genetic algorithm itself,
is the most time consuming aspect of the computation.

Given the mapping from genotype to phenotype, the goal function, and an initial random population the genetic
algorithm proceeds to create new members of the population (which progressively replace the old members) using
genetic operators, typically mutation, crossover and inversion, modelled on their biological analogs.

For the moment we represent strings as a1a2a3...al [ai = 1 or 0].

Using this notation we can describe the
operators by which strings are combined
to produce new strings. It is the choice of                                CROSSOVER
these operators which produces a search
strategy that exploits co-adapted sets of                                     cut points
structural compon en t s a l r eady
discovered. Holland uses three such
                                                         Parent 1          1011 010011 10111
principal operators Crossover, Mutation
and Inversion (which we shall not                        Parent 2          1100 111000 11010
discuss in detail here).
                                                         Child 1           1100 010011 11010
Crossover. In crossover one or more cut
points are selected at random and the                    Child 2           1011 111000 10111
operation illustrated in Figure 2-2,
Figure 7-1 (where two cut points are
employed) is used to create two children.                                  MUTATION
A variety of control regimes are possible,
but a simple strategy might be `select                 110011100011010                 111011101011010
one of the children at random to go into
the next generation'. Children tend to be
`like' their parents, so that crossover can
be considered as a focussing operator                                      INVERSION
which exploits knowledge already
gained, its effects are quite quickly                 111111100011010                  110011111011010
apparent.

Crossing over proceeds in three steps.        Figure 2-2 Standard genetic operators.


         a) Two structures a1...al and b1...bl are selected at random from the current population.

         b) A crossover point x, in the range 1 to l-1 is selected, again at random.

         c) Two new structures

                                                  a1a2...axbx+1bx+2...bl
                                                  b1b2...bxax+1ax+2...al
         are formed.

In modifying the pool of schema (discussed below), crossing over continually introduces new schema for trial
whilst testing extant schema in new contexts. It can be shown that each crossing over affects a great number of
schema.



                                                           17
Antonia J. Jones: 6 November 2005

There is large variation in the crossover operators which have been used by different experimenters. For example,
it is possible to cross at more than one point. The extreme case of this is where each allele is randomly selected
from one or other parent string with uniform probability - this is called uniform crossover. Although some writers
have argued in favour of uniform crossover, there would seem to be theoretical arguments against its use viz. if
evolution is the search for co-adapted sets of alleles then this search is likely to be severely undermined if many
cut points are used. In language we shall develop shortly: the probability of schema disruption when using uniform
crossover is much higher than when using one or two point crossover.

The design of the crossover operator is strongly influenced by the nature of the representation. For example, if the
problem is the TSP and the representation of a tour is a straightforward list of cities in the order in which they are
to be visited then a simple crossover operator will, in general, not produce a tour. In this case the options are:

         !   Change the representation.

         !   Modify the crossover operator.

or       !   Effect ‘genetic repair’ on non-tours which may result.

There is obviously much scope for experiment for any particular problem. The danger is that the resulting
algorithm may be so far removed from the canonical form that the correlation between parental and child fitness
may be small - in which case the whole justification for the method will have been lost.

Mutation. In mutation an allele is altered at each site with some fixed probability. Mutation disperses the
population throughout the search space and so might be considered as an information gathering or exploration
operator. Search by mutation is a slow process analogous to exhaustive search. Thus mutation is a ‘background’
operator, assuring that the crossover operator has a full range of alleles so that the adaptive plan is not trapped on
local optima.

Each structure a1a2...al in the population is operated upon as follows. Position x is modified, with probability p
independent of the other positions, so that the string is replaced by

                                                 a1a2...ax-1 z ax+1...al

where z is drawn at random from the possible values. If p is the probability of mutation at a single position then
the probability of h mutations in a given string is determined by a Poisson distribution with parameter p.

A simple demonstrator is given in the Mathematica program GA_Simple.nb. A more complicated GA using
Inversion is given in GA_Inversion.nb.

Design issues - what do you want the algorithm to do?

Now we have to ask just what is we want of a genetic algorithm. There are several, sometimes mutually exclusive,
possibilities. For example:

         !   Rapid convergence to a global optimum.

         !   Produce a diverse population of near optimal solutions in different ‘niches’.

         !   Be adaptive in ‘real-time’ to changes in the goal function.

We shall deal with each of these in turn but first let us briefly consider the nature of the search space. If the space
is flat with just one spike then no algorithm short of exhaustive search will suffice. If the space is smooth and
unimodal then a conventional hill-climbing technique should be used.


                                                          18
Antonia J. Jones: 6 November 2005

Somewhere between these two extremes are problems in which the goal function is a highly non-linear multi-
modal function of the gene values - these are the problems of hard combinatoric search for which some style of
genetic algorithm may be appropriate.

Rapid convergence to a global optimum.

Of course this is rather simplistic. Holland's theory holds for large populations. However, in many AI applications
it is computationally infeasible to use large populations and this in turn leads to a problem commonly referred to
as Premature Convergence (to a sub-optimal solution) or Loss of Diversity in the literature of genetic algorithms.
When this occurs the population tends to become dominated by one relatively good solution and locked into a sub-
optimal region of the search space. For small populations the schema theorem is actually an explanation for
premature convergence (i.e. the failure of the algorithm) rather than a result which explains success.

Premature convergence is related to a phenomenon observed in Nature. Allelic frequencies may fluctuate purely
by chance about their mean from one generation to another; this is termed Random Genetic Drift. Its effect on the
gene pool in a large population is negligible, but in a small effectively interbreeding population, chance alteration
in Mendelian ratios can have a significant effect on gene frequencies and can lead to the fixation of one allele and
loss of another. For example, isolated communities within a given population have been found to have frequencies
for blood group alleles different from the population as a whole. Figure 2-3 illustrates this phenomenon with a
simple function optimisation genetic algorithm.

The inexperienced often tend to attempt to counteract
premature convergence by increasing the rate of mutation.
However, this is not a good idea.

         !  A high rate of mutation tends to devalue the
         role of crossover in building co-adapted sets of
         alleles and in essence pushes the algorithm in
         the direction of exhaustive search. Whilst some
         mutation is necessary a high rate of mutation is
         invariably counter-productive.

In trying to counteract premature convergence we are
essentially trying to balance the exploitation of good
solutions found so far against the exploration which is
required to find hitherto unknown promising regions of
the search space. It is worth observing that, in
computational terms, any algorithm which often inserts
copies of strings into the current population is wasteful. Figure 2-3 Premature convergence - no sharing.
This is true for the Traditional Genetic Algorithm (TGA)
outlined as 2, 7-1.

Produce a diverse population of near optimal solutions in different `niches'.

The problem of premature convergence has been addressed by a number of authors using a diversity of techniques.
Many of the papers in [Davis 1987] contain discussions of precisely this point. The methods used to combat
premature convergence in TGAs are not necessarily appropriate to the parallel formulations of genetic algorithms
(PGAs) which we shall discuss shortly.

Cavicchio, in his doctoral dissertation, suggested a preselection mechanism as a means of promoting genotype
diversity. Preselection filters children generated, possibly picking the fittest, and replaces parent members of the
population with their offspring [Cavicchio 1970].



                                                         19
Antonia J. Jones: 6 November 2005

De Jong's crowding scheme is an elaboration of the preselection mechanism. In the crowding scheme, an offspring
replaces the most similar string from a randomly drawn subpopulation having size CF (the crowding factor) of the
current population. Thus a member of the population experiences a selection pressure in proportion to its similarity
to other members of the population [De Jong 1975]. Empirical determination of CF with a five function test bed
determined CF = 3 as optimal.

Booker implemented a sharing method in a classifier system environment which used the bucket brigade algorithm
[Booker 1982]. The idea here was that if related rules share payments then sub-populations of rules will form
naturally. However, it seems difficult to apply this mechanism to standard genetic algorithms. Schaffer has
extended the idea of sub-populations in his VEGA model in which each fitness element has its own sub-population
[Schaffer 1984].

A different approach to help maintain genotype diversity was introduced by Mauldin via his uniqueness operator
[Mauldin 1984]. The uniqueness operator helped to maintain diversity by incorporating a `censorship' operator
in which the insertion of an offspring into the population is possible only if the offspring is genotypically different
from all members of the population at a number of specified genotypical loci.

* Results and methods related to the TSP.

We digress briefly to give a little more detailed background material on the TSP. The question is often asked: if
one cannot exactly solve any very large TSP problem (except in special cases at present `very large' means a
problem involving more than a thousand cities) how can one know how accurate a solution produced by a
probabilistic or heuristic algorithm actually is?

The best exact solution methods for the travelling salesman problem are capable of solving problems of several
hundred cities [Grötschel 1991], but unfortunately excessive amounts of computer time are used in the process and,
as N increases, any exact solution method rapidly becomes impractical. For large problems we therefore have no
way of knowing the exact solution, but in order to gauge the solution quality of any algorithm we need a reasonably
accurate estimate of the minimal tour length. This is usually provided in one of two ways.

For a uniform distribution of cities the classic work by Beardwood, Halton and Hammersley (BHH) [Beardwood
1959] obtains an asymptotic best possible upper bound for the minimum tour length for large N. Let {Xi}, 1 i <   #
4 , be independent random variables uniformly distributed over the unit square, and let LN denote the shortest closed
path which connects all the elements of {X1,...,XN}. In the case of the unit square they proved, for example, that
there is a constant c > 0 such that, with probability 1,
                                                                    1/2
                                                 lim LN N          &      '   c                                   (1)
                                                 N   46


where c > 0 is a constant. In general c depends on the geometry of the region considered.

One can use the estimate provided by the BHH theorem in the following form: the expected length LN* of a minimal
tour for an N-city problem, in which the cities are uniformly distributed in a square region of the Euclidean plane,
is given by
                                                          (
                                                     LN       .   c2 NR                                           (2)

where R is the area of the square and the constant (for historical reasons known as Stein's constant - [Stein 1977])
c2 0.70805 ± 0.00007, recently been estimated by Johnson, McGeogh and Rothberg [Johnson 1996].
    .

A second possibility would be to use a problem specific estimate of the minimal tour length which gives a very
accurate estimate: the Held-Karp lower bound [Held 1970], [Held 1971]. Computing the Held-Karp lower bound
is an iterative process involving the evaluation of Minimal Spanning Trees for N-1 cities of the TSP followed by
Lagrangean relaxations, see [Valenzuela 1997].


                                                              20
Antonia J. Jones: 6 November 2005

If one seeks approximate solutions then various algorithms based on simple rule based heuristics (e.g. nearest
neighbour and greedy heuristics), or local search tour improvement heuristics (e.g. 2-Opt, 3-Opt and Lin-
Kernighan), can produce good quality solutions much faster than exact methods. A combinatorial local search
algorithm is built around a `combinatoric neighbourhood search' procedure, which given a tour, examines all tours
which are closely related to it and finds a shorter `neighbouring' tour, if one exists. Algorithms of this type are
discussed in [Papadimitriou 1982]. The definition of `closely related' varies with the details of the particular local
search heuristic.

The particularly successful combinatorial local search heuristic described by Lin and Kernighan [Lin 1973] defines
`neighbours' of a tour to be those tours which can be obtained from it by doing a limited number of interchanges
of tour edges with non-tour edges. The slickest local heuristic algorithms3, which on average tend to have
complexity O(n ), for > 2, can produce solutions with approximately 1-2% excess for 1000 cities in a few
                    "
                            "
minutes. However, for 10,000 cities the time escalates rapidly and one might expect that the solution quality also
degrades, see [Gorges-Schleuter 1990], p 101.

An approximation scheme A is an algorithm which given problem instance I and > 0 returns a solution of length
                                                                                               ,
A(I, ) such that
    ,
                                           A(I, ) Ln(I) *   ε   &       *
                                                                            #   ε                        (3)
                                                Ln(I)

Such an approximation scheme is called a fully polynomial time approximation scheme if its run time is bounded
by a function that is polynomial in both the instance size and 1/ . Unfortunately the following theorem holds, see
                                                                            ,
for example [Lawler 1985], p165-166.

Theorem. If     V N then there can be no fully polynomial time approximation scheme for the TSP, even if
                    P   V
instances are restricted to points in the plane under the Euclidean metric.

Although the possibility of a fully polynomial time approximation scheme is effectively ruled out, there remains
the possibility of an approximation scheme that although it is not polynomial in 1/ , does have a running time
                                                                                                    ,
which is polynomial in n for every fixed > 0. The Karp algorithms, based on cellular dissection, provide
                                                    ,
`probabilistic' approximation schemes for the geometric TSP.

Theorem [Karp 1977]. For every > 0 there is an algorithm A( ) such that A( ) runs in time C( )n+O(nlogn)
                                         ,                                      ,              ,                     ,
and, with probability 1, A( ) produces a tour of length not more than 1+ times the length of a minimal tour.
                                ,                                                      ,

The Karp-Steele algorithms [Steele 1986] can in principle converge in probability to near optimal tours very
rapidly. Cellular dissection is a form of divide and conquer. Karp's algorithms partition the region R into small
subregions, each containing about t cities. An exact or heuristic method is then applied to each subproblem and
the resulting sub-tours are finally patched together to yield a tour through all the cities.

Evolutionary Divide and Conquer.

Until recently the best genetic algorithms designed for TSP problems have used permutation crossovers for
example [Davis 1985], [Goldberg 1985], [Smith 1985], or edge recombination operators [Whitley 1989], and
required massive computing power to gain very good approximate solutions (often actually optimal) to problems
with a few hundred cities [Gorges-Schleuter 1990]. Gorges-Schleuter cleverly exploited the architecture of a
transputer bank to define a topology on the population and introduce local mating schemes which enabled her to
delay the onset of premature convergence. However, this improvement to the genetic algorithm is independent of



   3
     The most impressive results in this direction are due to David Johnson at AT&T Bell Laboratories - mostly reported in unpublished
Workshop presentations.

                                                                 21
Antonia J. Jones: 6 November 2005

any limitations inherent in permutation crossovers. Eventually, for problems of more than around 1000 cities, all
such genetic algorithms tend to produce a flat graph of improvement against number of individuals tested, no
matter how long they are run.

Thus experience with genetic algorithms using permutation operators applied to the Geometric Travelling
Salesman Problem (TSP) suggests that these algorithms fail in two respects when applied to very large problems:
they scale rather poorly as the number of cities n increases, and the solution quality degrades rapidly as the problem
size increases much above 1000 cities. An interesting novel approach developed by Valenzuela and Jones
[Valenzuela 1994] which seeks to circumvent these problems is based on the idea of using the genetic algorithm
to explore the space of problem subdivisions, rather than the space of solutions itself.

This alternative method, for genetic algorithms applied to hard combinatoric search, can be described as
Evolutionary Divide and Conquer (EDAC), and the approach has potential for any search problem in which
knowledge of good solutions for subproblems can be exploited to improve the solution of the problem itself. As they
say
         ! Essentially we are suggesting that intrinsic parallelism is no substitute for divide and conquer in hard combinatoric search and
         we aim to have both. [Valenzulea 1994]

The goal was to develop a genetic algorithm capable of producing reasonable quality solutions for problems of
several thousand cities, and one which will scale well as the problem size n increases. `Scaling well' in this context
almost inevitably means a time complexity of O(n) or at worst O(nlogn). This is a fairly severe constraint, for
example given a list of n city co-ordinates the simple act of computing all possible edge lengths, a O(n2) operation
is excluded. Such an operation may be tolerable for n = 5000 but becomes intolerable for n = 100,000.

In the previous section we mentioned the Karp and Steele cellular disection algorithms, and it is this technique
which is the basis of the Valenzuela-Jones EDAC genetic algorithms for the TSP.




          Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1.




                                                                  22
Antonia J. Jones: 6 November 2005

In practice a one-shot deterministic Karp algorithm yields
rather poor solutions, typically 30% excess (with simple
patching) when applied to 500 - 1000 city problems.
Nevertheless, the Karp technique is a good starting point
for exploring EDAC applied to the TSP. There are several
reasons. First, according to Karp's theorem there is some
probabilistic asymptotic guarantee of solution quality as
the problem size increases. Second, the time complexity is
about as good as one can hope for, namely O(nlogn). The
run time of a genetic algorithm based on exploring the
space of `Karp-like' solutions will be proportional to nlogn
multiplied by the number of times the Karp algorithm is
run, i.e. the number of individuals tested.

Karp's algorithm proceeds by partitioning the problem
recursively from the top down. At each step the current
rectangle is bisected horizontally or vertically, according
to a deterministic rule designed to keep the rectangle
perimeter minimal. This bisection proceeds until each Figure 2-5 The EDAC (top) and simple 2-Opt
subrectangle contains a preset maximum number of cities (bottom) time complexity (log scales).
t (typically t 10). Each small subproblem is then solved
             -
and the resulting subtours are patched together to produce a solution to the original problem - see Figure 2-4

In the EDAC algorithm the genotype is a p X p binary array in which a `1' or `0' indicates whether to cut
horizontally or vertically at the current bisection. If we maintain the subproblem size, t, and increase the number
of cities in the TSP, then a partition better than Karp's becomes progressively harder to find by randomly choosing
a horizontal or vertical bisection at each step. If the problem size is n 2kt, where 2k is the number of subsquares,
                                                                       -
then the corresponding genotype requires at least n/t - 1 bits. The size of the partition space is 2 to the power p2,
which for p = 80 (the value used for n = 5000) is approximately exp(4436). For n = 5000 the size of permutation
search space, roughly estimated using Stirling's formula, is around exp(37586). Thus searching partition space is
easier than searching permutation space and this provides a third argument in favour of exploring this
representation of problem subdivision as a genotype. We know from Karp's theorem that the class of tours
produced by disection and patching will have representatives very close to the optimum tour, so by restricting
attention to this smaller set one is not `throwing out the baby with the bath-water', i.e. the set may be smaller but
it nevertheless contains near optimal tours.

This approach contrasts sharply with the idea of `broadcast languages' mooted in Chapter 8 of [Holland 1975], in
which techniques for searching the space of representations for a genetic algorithm are discussed. In general the
space of representations is vastly larger than the search space of the problem itself, but we have seen with the TSP
that this space is already so huge that it is impractical to search in any comprehensive fashion for all except the
smallest problems. Hence, it seems unlikely that replacing the original search space by an even larger one will turn
out to be a productive approach.




                                                         23
Antonia J. Jones: 6 November 2005




  Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem.


In any event even the EDAC algorithm requires clever recursive repair techniques to improve the accuracy when
subtours are patched together. Nevertheless, the algorithm scales well. Figure 2-5 compares the EDAC algorithms
with simple 2-Opt (which gives an accuracy of around 8% excess). This version of the EDAC algorithm produces
solutions at the 5% level, see Figure 2-6, but a later more elaborate variant reliably produces solutions with around
1% excess and has been tested on problems sizes of up to 10,000 cities.

This technique probably represents the best that can be done at the present time using genetic algorithms for the
TSP. It is not yet practical by comparison with iterated Lin-Kernighan (or even 2-Opt)4, but it scales well and may
eventually offer a viable technique for obtaining good solutions to TSP problems involving several hundred
thousand cities.

Parallel EDACII and EDACIII were both tested on a range of problems between 500 and 5000 cities. Parental pairs
were chosen from the initial random population and the mid-parent value of the tour lengths calculated and
recorded. Crossover and mutation were then applied to each selected parental pair and the tour length evaluated
for the resulting offspring. Pearson's correlation coefficient, rxy, was calculated in each experiment and significance
tests based on Fisher's transformation carried out in order to establish whether the resulting correlation coefficients
differed significantly from zero (i.e. no correlation). Scatter diagrams in Figure 2-7 and Figure 2-8 illustrate the
Price correlation for parallel EDACII and EDACIII on the 5000 city problem.

Although the genotype used in these experiments was a binary array it could more naturally (at the cost of
complication in the coding) be represented by a pair of binary trees, or a quadtree. The use of trees here would be
more in keeping with the recursive construction of the phenotype from the genotype, a process analogous to



   4
     For example, wildly extrapolating the figures gives the breakeven point with 2-Opt at around n = 422,800 requiring some 74 cpu days!
Of course, other things would collapse before then.

                                                                  24
Antonia J. Jones: 6 November 2005

growth, and it is possible to produce a modified Schema theorem for the case of trees, where the genetic
information is encoded in the shape of the tree and information placed at leaf nodes.



         113.0
                                                                                   112.0




         111.0                                                                     109.7




         109.0                                                                     107.5




         107.0                                                                     105.2




                                                                                   103.0
         105.0                                                                             103.0   104.5          106.0       107.5   109.0
                 105.0   107.0          109.0       111.0    113.0
                                                                                                           mid-parent value
                                 mid-parent value




Figure 2-7 EDACII mid-parent vs offspring correlation Figure 2-8 EDACIII mid-parent vs offspring correlation
for 5000 cities [Valenzuela 1995].                    for 5000 cities [Valenzuela 1995].

In Nature very complex phenotypical structures frequently give the appearance of having been constructed
recursively from the genotype. Examples of recursive algorithms which lead to very natural looking graphical
representations of natural living structures such as trees, plants, and so on, can be found in the work of
Lindenmayer [Lindenmayer 1971] on what are now called L-systems. These production systems are very similar
to the production rules which define various kinds of context sensitive or context free grammars. The combination
of tree structured genotypes, or recursive construction algorithms similar to production rules, combined with the
divide-and-conquer paradigm suggest a powerful computational technique for the compression of complex
phenotypical structures into useful genotypical structures. So much so that, as our understanding of exactly how
DNA encodes the phenotypical structure of individual biological organisms (particularly the neural systems of
mammals) progresses, it would be surprising if to find that Nature has not employed some such technique.

                                                              Chapter references


[Altenberg 1987] L. Altenberg and M. W. Feldman. Selection, generalised transmission, and the evolution of
modifier genes. The reduction principle. Genetics 117:559-572.

[Altenburg 1994] L. Altenberg. The Evolution of Evolvability in Genetic Programming. Chapter 3 in Advances
in Genetic Programming, Ed Kenneth E. Kinnear, Jr., MIT Press, 1994.

[Belew 1990] R. Belew, J. McInerney, and N. N. Schraudolph. Evolving networks: Using the genetic algorithm
with connectionist learning. CSE Technical Report CS90-174, University of California, San Diego, 1990.

[Booker 1982] L. B. Booker. Intelligent behaviour as an adaption to the task environment. Doctoral dissertation,
University of Michigan, 1982. Dissertation Abstracts International 43(2), 469B.

[Brandon 1990] R. N. Brandon. Adaptation and Environment, pages 83-84. Princeton University Press, 1990.

[Cavalli-Sforza 1976] L. L. Cavelli-Sforza and M. W. Feldman. Evolution of continuous variation: direct
approach through joint distribution of genotypes and phenotypes. Proceedings of the national Academy of Science
U.S.A., 73:1689-1692, 1976.



                                                                      25
Antonia J. Jones: 6 November 2005

[Cavicchio 1970] D. J. Cavicchio. Adaptive search using simulated evolution. Doctoral dissertation, University
of Michigan (unpublished), 1970.

[Chalmers 1990] David J. Chalmers. The Evolution of Learning: An experiment in Genetic Connectionism.
Proceedings of the 1990 Connectionist Models Summer School, San Marco, CA. Morgan Kaufmann, 1990.

[Collins 1991] R. J. Collins and D. R. Jefferson. Selection in massively parallel genetic algorithms. Proceedings
of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991.

[Davis 1987] Lawrence Davis, Editor. Genetic Algorithms and Simulated Annealing, Pitman Publishing, London.

[De Jong 1975] K. De Jong. An analysis of the behaviour of a class of genetic adaptive systems. Doctoral
dissertation, University of Michigan, 1975. Dissertation Abstracts International 36(10), 5140B.

[Freedman 1991] D. Freedman, R. Pisani, R. Purves and A. Adhikkari. Statistics, Second edition, W. W. Norton,
New York, 1991.

[Georges-Schleuter 1990] Martina Georges-Schleuter. Genetic Algorithms and Population Structures, A Massively
Parallel Algorithm. Ph.D. Thesis, Department of Computer Science, University of Dortmund, Germany. August
1990.

[Goldberg 1987] David E. Goldberg and Jon Richardson. Genetic Algorithms with Sharing for Multimodal
Function Optimization. Proc. Second Int. Conf. on Genetic Algorithms, pp. 41-49, MIT.

[Gorges-Schleuter 1990] Martina Gorges-Schleuter. Genetic Algorithms and Population Structures: A Massively
Parallel Algorithm. Ph.D. Thesis, University of Dortmund, August 1990.

[Grefenstette 1987] John J. Grefenstette. Incorporating Problem Specific Knowledge into Genetic Algorithms. In
Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing, London.

[Holland 1975] John H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan
Press.

[Horowitz 1978] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. London, Pitman Publishing
Ltd.

[Johnson 1996] D. S. Johnson, L. A. McGeoch and E. E. Rothberg. Asymptotic experimental analysis for the Held-
Karp traveling salesman bound. Proceeding 1996 ACM-SIAM symp. on Discrete Algorithms, to appear.

[Jones 1993] Antonia J. Jones. Genetic Algorithms and their Applications to the Design of Neural Networks,
Neural Computing & Applications, 1(1):32-45, 1993.

[Koza 1992] John L. Koza. Genetic Programming: On the Programming of Computers by Means of Natural
Selection. Bradford Books, MIT Press, 1992. ISBN 0-262-11170-5.

[Lindenmayer 1971] A. Lindenmayer. Developmental systems without cellular interaction, their languages and
grammars. J. Theoretical Biology 30, 455-484, 1971.

[Lyubich 1992] Y. I. Lyubich. Mathematical Structures in Population Genetics. Springer-Verlag, New York, pages
291-306. 1992.

[Manderick 1989] B. Manderick and P. Spiessens. Fine grained parallel genetic algorithms. Proceedings of the
third international conference on genetic algorithms. Morgan Kaufmann, 1989.


                                                       26
Antonia J. Jones: 6 November 2005

[Manderick 1991] Manderick, B. de Weger, M. and Spiessens, P. The genetic algorithm and the structure of the
fitness landscape. In R. K. Belew and L. B. Booker, Editors, Proceedings of the Fourth International Conference
on Genetic Algorithms, pages 143-150, San Mateo CA, Morgan Kaufmann.

[Mauldin 1984] M. L. Mauldin. Maintaining diversity in genetic search. National Conference on Artificial
Intelligence, 247-250, 1984.

[Macfarlane 1993] D. Macfarlane and Antonia J. Jones. Comparing networks with differing neural-node functions
using Transputer based genetic algorithms. Neural Computing & Applications, 1(4): 256-267, 1993.

[Menczer 1992] Menczer,F. and Parisi, D. Evidence of hyperplanes in the genetic learning of neural networks.
Biological Cybernetics 66(3):283-289.

[Miller 1989] G. Miller, P. Todd, and S. Hegde. Designing neural networks using genetic algorithms. In
Proceedings of the Third Conference on Genetic Algorithms and their Applications, San Mateo, CA. Morgan
Kaufmann, 1989.

[Muhlenbein 1988] H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer. Evolution Algorithms in Combinatorial
Optimisation. Parallel Computing, 7, pp. 65-85.

[Price 1970] G. R. Price. Selection and covariance. Nature, 227:520-521.

[Price 1972] G. R. Price. Extension of covariance mathematics. Annals of Human Genetics 35:485-489.

[Salmon 1971] W. C. Salmon. Statistical Explanation and Statistical Relevance. University of Pittsburg Press,
Pittsburgh, 1971.

[Schaffer 1984] J. D. Schaffer. Some Experiments in Machine Learning Using Vector Evaluated Genetic
Algorithms. Ph.D. Thesis, Department of Electrical Engineering, Vanderbilt University, December 1984.

[Schwefel 1965] H-P Schwefel. Kybernetische Evolution als Strategie experimentellen Forschung in der
Strömungstechnik. Diploma thesis, Technical University of Berlin, 1965.

[Slatkin 1970] M. Slatkin. Selection and polygenic characters. Proceedings of the National Academy of Sciences
U.S.A. 66:87-93. 1970.

[Smith 1980] S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms. Ph.D. Dissertation,
University of Pittsburg.

[Spiessens 1991] P. Spiessens and B. Manderick. A massively parallel genetic algorithm - implementation and
first analysis. Proceedings of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991.

[Valenzuela 1994] Christine L. Valenzuela and Antonia J. Jones. Evolutionary Divide and Conquer (I): A novel
genetic approach to the TSP. Evolutionary Computation 1(4):313-333, 1994.

[Valenzuela 1995] Christine L. Valenzuela. Evolutionary Divide and Conquer: A novel genetic approach to the
TSP. Ph.D. Thesis, Department of Computing, Imperial College, London. 1995

[Valenzuela 1997] Christine L. Valenzuela and Antonia J. Jones. Estimating the Held-Karp lower bound for the
geometric TSP. To appear: European Journal of Operational Research, 1997.

[Whitley 1990] D. Whitley, T. Starkweather, and C. Bogart. Genetic algorithms and neural networks: Optimizing
connections and connectivity. Parallel Computing, forthcoming.


                                                      27
Antonia J. Jones: 6 November 2005

[Wilson 1990] Perceptron redux. Physica D, forthcoming.




                                                   28
Antonia J. Jones: 6 November 2005




                                                       III Hopfield networks.



Introduction.

As far back as 1954 Cragg and Temperley [Cragg 1954] had introduced the spin-neuron analogy using a
ferromagnetic model. They remarked

         "It remains to be considered whether biological analogues can be found for the concepts of temperature and interaction
         energy in a physical system."

In 1974 Little [Little 1974] introduced the temperature-noise analogy, but it was not until 1982 that John Hopfield
[Hopfield 1982], a physicist, made significant progress in the direction requested by Cragg and Temperley. In a
single short paragraph, he suggests one of the most important new techniques to have been proposed in neural
networks.

Hopfield nets and energy.

The standard approach to a neural network is to propose a learning rule, usually based on synaptic modification,
and then to show that a number of interesting effects arise from it. Hopfield starts by saying that:

         "The function of the nervous system is to develop a number of locally stable states in state space."

Other points in state space flow into the stable points, called attractors. In some other dynamic systems the
behaviour is much more complex, for example the system may orbit two or more points in state space in a non-
periodic way, see [Abraham 1985]. However, this turns out not to be the case for the Hopfield net.

The flow of the system towards a stable point allows a mechanism for correcting errors, since deviations from the
stable points disappear. The system can thus reconstruct missing information since the stable point will
appropriately complete missing parts of an incomplete initial state vector.

Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at
maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined
as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values
of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm.
For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean
                                                       2
attempt rate µ, setting

                                          x i(t)   '   1                                  >   θ   i
                                          x i(t)   '   x i(t 1)
                                                           &      if   j     wijxj(t 1)
                                                                                   &      '   θ   i                               (1)
                                          x i(t)   '   0               j i
                                                                       …                  <   θ   i



Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts
accordingly.

Although this model has superficial similarities to the Perceptron there are essential differences. Firstly,
Perceptrons were modelled chiefly with the neural connections in a `forward' direction and the analysis of such
networks with backward coupling proved intractable. All the interesting results of the Hopfield model arise as a
consequence of the strong backward coupling. Secondly, studies of perceptrons usually made a random net of

                                                                   29
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing
Evolutionary computing

More Related Content

Similar to Evolutionary computing

Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...
Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...
Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...AmrYassin23
 
Senior_Thesis_Evan_Oman
Senior_Thesis_Evan_OmanSenior_Thesis_Evan_Oman
Senior_Thesis_Evan_OmanEvan Oman
 
50YearsDataScience.pdf
50YearsDataScience.pdf50YearsDataScience.pdf
50YearsDataScience.pdfJyothi Jangam
 
Virtual Machines and Consciousness
Virtual Machines and ConsciousnessVirtual Machines and Consciousness
Virtual Machines and Consciousnesssantanawill
 
Week 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docx
Week 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docxWeek 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docx
Week 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docxjessiehampson
 
Analytical Report (1)
Analytical Report (1)Analytical Report (1)
Analytical Report (1)Kade Schmitz
 
Bluebrain.doc
Bluebrain.docBluebrain.doc
Bluebrain.docArun Nair
 
Undergraduated Thesis
Undergraduated ThesisUndergraduated Thesis
Undergraduated ThesisVictor Li
 
Nonlinear image processing using artificial neural
Nonlinear image processing using artificial neuralNonlinear image processing using artificial neural
Nonlinear image processing using artificial neuralHưng Đặng
 
Analysis and Simulation of Scienti c Networks
Analysis and Simulation of Scientic NetworksAnalysis and Simulation of Scientic Networks
Analysis and Simulation of Scienti c NetworksFelix Puetsch
 
Coherent Behavior In Neuronal Networks
Coherent Behavior In Neuronal NetworksCoherent Behavior In Neuronal Networks
Coherent Behavior In Neuronal NetworksLirigzon Gashi
 
BLUE BRAIN SEMINAR REPORT
BLUE BRAIN SEMINAR REPORTBLUE BRAIN SEMINAR REPORT
BLUE BRAIN SEMINAR REPORTGautam Kumar
 

Similar to Evolutionary computing (20)

Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...
Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...
Extending the Scalability of Linkage Learning Genetic Algorithms Theory & Pra...
 
Communication Theory
Communication TheoryCommunication Theory
Communication Theory
 
Senior_Thesis_Evan_Oman
Senior_Thesis_Evan_OmanSenior_Thesis_Evan_Oman
Senior_Thesis_Evan_Oman
 
My User Experience Journals
My User Experience JournalsMy User Experience Journals
My User Experience Journals
 
50YearsDataScience.pdf
50YearsDataScience.pdf50YearsDataScience.pdf
50YearsDataScience.pdf
 
Virtual Machines and Consciousness
Virtual Machines and ConsciousnessVirtual Machines and Consciousness
Virtual Machines and Consciousness
 
Greenberg_Michael_A_Game_of_Millions_FINALC
Greenberg_Michael_A_Game_of_Millions_FINALCGreenberg_Michael_A_Game_of_Millions_FINALC
Greenberg_Michael_A_Game_of_Millions_FINALC
 
Week 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docx
Week 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docxWeek 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docx
Week 2 Assignment 2 Presentation TopicsSubmit Assignment· Due.docx
 
Analytical Report (1)
Analytical Report (1)Analytical Report (1)
Analytical Report (1)
 
thesis
thesisthesis
thesis
 
N ra d tr 1076
N ra d tr 1076N ra d tr 1076
N ra d tr 1076
 
Bluebrain.doc
Bluebrain.docBluebrain.doc
Bluebrain.doc
 
Undergraduated Thesis
Undergraduated ThesisUndergraduated Thesis
Undergraduated Thesis
 
General Anatomy - sample
General Anatomy - sample General Anatomy - sample
General Anatomy - sample
 
Nonlinear image processing using artificial neural
Nonlinear image processing using artificial neuralNonlinear image processing using artificial neural
Nonlinear image processing using artificial neural
 
Analysis and Simulation of Scienti c Networks
Analysis and Simulation of Scientic NetworksAnalysis and Simulation of Scientic Networks
Analysis and Simulation of Scienti c Networks
 
mscthesis
mscthesismscthesis
mscthesis
 
Coherent Behavior In Neuronal Networks
Coherent Behavior In Neuronal NetworksCoherent Behavior In Neuronal Networks
Coherent Behavior In Neuronal Networks
 
BLUE BRAIN SEMINAR REPORT
BLUE BRAIN SEMINAR REPORTBLUE BRAIN SEMINAR REPORT
BLUE BRAIN SEMINAR REPORT
 
Howe
HoweHowe
Howe
 

Evolutionary computing

  • 1. EVOLUTIONARY COMPUTING CMT563 Antonia J. Jones 6 November 2005
  • 2. Antonia J. Jones: 6 November 2005 UNIVERSITY OF WALES, CARDIFF DEPARTMENT OF COMPUTER SCIENCE (COMSC) COURSE: M.Sc. CMT563 MODULE: Evolutionary Computing LECTURER: Antonia J. Jones, COMSC DATED: Originally 15 January 1997 LAST REVISED: 6 November 2005 ACCESS: Lecturer (extn 5490, room N2.15). Overhead slides are posted on: http://users.cs.cf.ac.uk:81/Antonia.J.Jones/ electronically as pdf Acrobat files. It is not normally necessary for students attending the course to print this file as complete sets of printed slides will be issued. ©2001 Antonia J. Jones. Permission is hereby granted to any web surfer for downloading, printing and use of this material for personal study only. Copyright permission is explicitly withheld for modification, re-circulation or publication by any other means, or commercial exploitation in any manner whatsoever, of this file or the material therein. Bibliography: MAIN RECOMMENDATIONS The recommended text for the course is: [Hertz 1991] J. Hertz, A. Krough, and R. G. Palmer, Introduction to the theory of neural computing. Addison-Wesley, 1991. ISBN 0-201-51560-1 (pbk). A cheaper alternative is Yoh-Han Pao, Adaptive pattern recognition and neural networks. Addison-Wesley, 1989. ISBN 0-201-12584-6. Price (UK) £31.45. A useful addition for the Mathematica labs is: Simulating Neural Networks with Mathematica. James A. Freeman. Addison-Wesley. 1994. ISBN 0-201- 56629-X. These books cover most of the course, except any theory on genetic algorithms, and the first is the recommended book for the course because it has excellent mathematical analyses of many of the models we shall discuss. The second includes some interesting material on the application Bayesian statistics and Fuzzy logic to adaptive pattern recognition. It is clearly written and the emphasis is on computing rather than physiological models. 1
  • 3. Antonia J. Jones: 6 November 2005 The principle sources of inspiration for work in neuro and evolutionary computation are: ! E. R. Kandel, J. H. Schwartz, and T. M. Jessel. Principles of Neural Science (Third Edition), Prentice-Hall Inc., 1991. ISBN 0-8385-8068-8. ! J. D. Watson, Nancy H. Hopkins, J. W. Roberts, Joan A. Steitz, and A. M. Weiner. Molecular Biology of the Gene, Benjamin/Cummings Publishing Company Inc., 1988. ISBN 0-8053- 9614-4. When you see how big they are you will understand why! It is a sobering thought that most of the knowledge in these tomes has been obtained in the last 20 years. Although extensive references are provided with the course notes (these are also a useful source of information for projects in Neural Computing) definitive bibliographies for computing aspects of the subject are: The 1989 Neuro-Computing Bibliography. Ed. Casimir C. Klimasauskas, MIT Press / Bradford Books. 1989. ISBN 0-262-11134-9. Finally, the key papers up to 1988 can be found together in: Neurocomputing: Foundations of Research. Ed. James A. Anderson and Edward Rosenfeld, MIT Press 1988. ISBN 0-262-01097-6. NETS - OTHER (HISTORICALLY) INTERESTING MATERIAL Perceptrons, Marvin Minsky and Seymour Papert, MIT Press 1972. ISBN 0-262-63022-2 (was reprinted recently). Neural Assemblies, G. Palm, Springer-Verlag, 1982. Self-Organisation and Associative Memory, T. Kohonen, Springer- Verlag, 1984. Parallel Models of Associative Memory, G. E. Hinton and J. A. Anderson, Lawrence Erlbaum, 1981. Connectionist Models and Their Applications, Special Issue of Cognitive Science 9, 1985. Computer, March 1988. Artificial Neural Systems, IEEE. Neural Computing Architectures, Ed. I. Aleksander, Kogan Page, December, 1988. Parallel Distributed Processing. Vol.I Foundations. Vol. II Psychological and Biological Models. David E. Rumelhart et. al., MIT Press / Bradford Books. 1986. ISBN 0-262-18123-1 (Set). Explorations in Parallel Distributed Processing - A Handbook of Models, Programs, and Exercises. James L. McClelland and David E. Rumelhart, MIT Press / Bradford Books. 1988. ISBN 0-262-63113X. (Includes some very useful software for an IBM PC - there is also a newer version with software for the MAC). GENERAL An Introduction to Cybernetics, W. Ross-Ashby, John Wiley and Sons, 1964. A classic text on cybernetics. Vision: A computational investigation into the human representation and processing of visual information, David 2
  • 4. Antonia J. Jones: 6 November 2005 Marr, W. H. Freeman and Company, 1982. ISBN 0-7167-1284-9. One of the classic works in computational vision. Artificial Intelligence, F. H. George, Gordon & Breach, 1985. Useful textbook on AI. GENETIC ALGORITHMS/ARTIFICIAL LIFE Artificial Life, Ed. Christopher G. Langton, Addison-Wesley 1989. ISBN 0-201-09356-1 pbk. A fascinating collection of essays from the first AL workshop at Los Alamos National Laboratory in 1987. The book covers an enormous range of topics (genetics, self-replication, cellular automata, etc.) on this subject in a very readable way but with great technical authority. There are innumerable figures, some forty colour plates and even some simple programs to experiment with. All this leads to a book that is beautifully presented and compulsive reading for anyone with a modest background in the field. Synthetic systems that exhibit behaviour characteristic of living systems complement the traditional analysis of living systems practised by the biological sciences. It is an approach to the study of life that would hardly be feasible without the advent of the modern computer and may eventually lead to a theory of living systems which is independent of the physical realisation of the organisms (carbon based, in this neck of the woods). The primary goal of the first workshop was to collect different models and methodologies from scattered publications and to present as many of these as possible in a uniform way. The distilled essence of the book is the theme that Artificial Life involves the realisation of lifelike behaviour on the part of man-made systems consisting of populations of semi-autonomous entities whose local interactions with one another are governed by a set of simple rules. Such systems contain no rules for the behaviour of the population at the global level. Adaptation in Natural and Artificial Systems, John H. Holland, University of Michigan Press, 1975. The book that started Genetic Algorithms, a classic. Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing 1987. ISBN 0-273-08771-1 (UK), 0-934613-44-3 (US). A collection of interesting papers on GA related subjects. Genetic Algorithms in Search, Optimization, and Machine Learning, David E. Goldberg, Addison-Wesley, 1989. ISBN 0-201-15767-5. The first real text book on GAs. 3
  • 5. Antonia J. Jones: 6 November 2005 CONTENTS I What is evolutionary computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 A general framework for neural models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Hebbian learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 The need for machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 II Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 The archetypal GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Design issues - what do you want the algorithm to do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Rapid convergence to a global optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Produce a diverse population of near optimal solutions in different `niches' . . . . . . . . . . . 19 * Results and methods related to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Evolutionary Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 III Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Hopfield nets and energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 The outer product rule for assigning weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Networks for combinatoric search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Assignment of weights for the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 * The Hopfield and Tank application to the TSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 IV The WISARD model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Wisard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 WISARD - analysis of response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Comparison of storage requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 V Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Backpropagation - mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 The output layer calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 The rule for adjusting weights in hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 The conventional model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Problems with backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 The gamma test - a new technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 * Metabackpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 * Neural networks for adaptive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 * VI The chaotic frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Mathematical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4
  • 6. Antonia J. Jones: 6 November 2005 Chaos in biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Controlling chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 The original OGY control law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chaotic conventional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Controlling chaotic neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Control varying T in a particular layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Using small variations of the inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Time delayed feedback and a generic scheme for chaotic neural networks . . . . . . . . . . . . . . . . . . . 70 Example: Controlling the Hénon neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 COURSEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 LIST OF FIGURES Figure 1-1 The stylised version of a standard connectionist neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec]. . . . . . . . . . . . . . . . . 12 Figure 1-3 Storage capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 2-1 Generic model for a genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 2-2 Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2-3 Premature convergence - no sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1. . . . . . . . . . . . . . . . 22 Figure 2-5 The EDAC (top) and simple 2-Opt (bottom) time complexity (log scales). . . . . . . . . . . . . . . . . . 23 Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem. . . . . . . . . . . . . . . . . . . . 24 Figure 2-7 EDACII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . . 25 Figure 2-8 EDACIII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. . . . . . . . . . . . . 25 Figure 3-1 Distance Connections. Each node (i, p) has inhibitory connections to the two adjacent columns whose weights reflect the cost of joining the three cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 3-2 Exclusion connections. Each node (i, p) has inhibitory connections to all units in the same row and column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 4-1 Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 4-2 Continuous response of discriminators to the input word 'toothache' [From Neural Computing Architectures, Ed. I Aleksander]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 4-3 A discriminator for centering on a bar [From Neural Computing, I. Aleksander and H. Morton]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 5-1 Solving the XOR problem with a hidden unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 5-2 Feedforward network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 5-3 The previous layer calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 5-4 The Water Tank Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 5-5 Architecture for direct inverse neurocontrol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 5-6 Delta transformation of state differences: maps to the 2-4-2 network inputs for the Water Tank Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 5-7 Least squares fit to 200 data points with 20 nearest neighbours: = 0.0332. . . . . . . . . . . . . . . . . 56 Figure 5-8 Volume variation without adaptive training. 2-4-2 Network. MSE = 0.052. Linear Planner. . . . 57 Figure 5-9 Temperature variation without adaptive training. 2-4-2 Network MSE = 0.052. Linear Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5-10 The control signals generated by the 2-4-2 network without adaptive training. Linear Planner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 6-1 Stable attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Figure 6-2 A chaotic time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Figure 6-3 The butterfly effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5
  • 7. Antonia J. Jones: 6 November 2005 Figure 6-4 Intervals for which the variables are defined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 6-5 Feedforward network as a dynamical system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6-6 Chaotic attractor of Wang's neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 6-7 The Ikeda strange attractor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6-8 Attractor for the chaotic 2-10-10-2 neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 6-9 Bifurcation diagram x obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . . 68 Figure 6-10 Bifurcation diagram y obtained by varying T in the output layer only. . . . . . . . . . . . . . . . . . . . 68 Figure 6-11 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6-12 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6-13 Parameter changes during output layer control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 6-14 Bifurcation diagram for the output x(t+1) using an external variable added to the input x(t). . . 69 Figure 6-15 Bifurcation diagram for the output y(t+1) using an external variable added to the input x(t). . . 69 Figure 6-16 Variations of x from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6-17 Variations of y from initiation of control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6-18 Parameter changes during input x control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 6-19 A general scheme for constructing a stimulus-response chaotic recurrent neural network: the chaotic "delayed" network is trained on suitable input-output data constructed from a chaotic time series; a delayed feedback control is applied to each input line; entry points for external stimulus are suggested, with a switch signal to activate the control module during external stimulation; signals on the delay lines or output can be observed at the "observation points". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6-20 The control signal corresponding to the delayed feedback control shown in Figure 6-21. Note that the control signal becomes small. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6-21 Response signal on x(n-6) with control signal activated on x(n-6) using k = 0.441628, = 2 and J without external stimulation after first 10 transient iterations. After n = 1000 iterations, the control is switched off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 6-22 Response signals on network output x(n), with control signal activated on x(n-6) using k = 0.441628, = 2 and with constant external stimulation sn added to x(n-6), where sn varies from -1.5 to 1.5 in steps J of 0.1 at each 500 iterative steps (indicated by the change of Hue of the plot points) after 20 initial transient steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 6-23 The control signal corresponding to the delayed feedback control shown in Figure 6-22. Note that the control signal becomes small even when the network is under changing external stimulation. . 72 Figure 6-24 Response signals on network output x(n), with control setup same as in Figure 6-22 but with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.05, at each iteration step. . . . . 73 F Figure 6-25 The control signal corresponding to the delayed feedback control shown in Figure 6-24. . . . . 73 Figure 6-26 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.15, at each iteration step. F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 6-27 Response signals on network output x(n), with control experiment setup same as in Figure 6-22 but with Gaussian noise r added to external stimulation, i.e. sn+r, with = 0.3, at each iteration step. F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Standard genetic operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Schematic of a 3-tuple recogniser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 LIST OF ALGORITHMS Algorithm 2-1 Archetypal genetic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Algorithm 3-1 Hopfield network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Algorithm 5-1 The Gamma test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Algorithm 5-2 Metabackpropagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Algorithm 7-1 Generic GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Algorithm 7-2 Generic Hopfield net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6
  • 8. Antonia J. Jones: 6 November 2005 I What is evolutionary computing? " dna tsap naht erutan enon-ro-lla na fo yldigir ssel hcum era hcihw seiroeht ot dael lliw siht fo llA .retcarahc ,lacitylana erom hcum dna ,lacirotanibmoc ssel hcum a fo eb lliw yehT .cigol lamrof tneserp ll i w cigol lamrof fo metsys wen siht taht eveileb su ekam ot snoitacidni suoremun era ereht ,tcaf nI si sihT .cigol htiw tsap eht ni deknil elttil neeb s a h h c i h w e n i l picsid rehtona ot resolc evom fo trap taht si ti dna ,nnamztloB morf deviecer saw ti mrof eht ni yliramirp ,scimanydomreht g n i r u s a e m d n a g n i t al u p i n a m o t s t c e p s a s t i f o e m o s n i t s e r a e n s e m o c h c i h w s c i s y h p l a c i t e r o e h t ]403 .p ,5 .loV skroW detcelloC ,nnamueN nov[ ".noitamrofni Introduction. Evolutionary computing embraces models of computation inspired by living Nature. For example, evolution of species by means of natural selection and the genetic operators of mutation, sexual reproduction and inversion can be considered as a parallel search process. Perhaps we can tackle hard combinatoric search problems in computer science by mimicking (in a very stylised form) the natural process of evolutionary search. Evolution through natural selection drives the adaptation of whole species, but individual members of a species can also adapt to a greater or lesser extent. The adaptation of individual behaviour on the basis of experience is learning and stems from plasticity of the neural structures which convey and process information in animals. Learning enables us to recognise previously encountered stimuli or experiences and modify our behaviour accordingly. It facilitates prediction and control of the environment, both essential prerequisites to planning. All of these are facets of what we loosely call intelligence. Real-world intelligence is essentially a computational process. This is a contentious assertion known as "the strong AI position". If it is true then the precise mechanism of computation (the hardware or neural wetware) ought to be irrelevant to the actual principles of the computational process. If this is indeed the case then the only obstacles to the construction of a truly intelligent artifact are our own understanding of the computational processes involved and our technical capability to construct suitable and sufficiently powerful computational devices. A general framework for neural models. Throughout this course we describe a number of neural models, each a variation on the connectionist paradigm (often called Parallel Distributed Processing - PDP), which in turn is derived from networks of highly stylised versions of the biological neuron. It is useful to begin with an analysis of the various components of these models. There are seven major aspects of a connectionist model: ! A set of processing units ui, each producing a scalar output xi(t) (1 # #i n). ! A connectivity graph which determines the pattern of connections (links) from each unit to each of the other units in the network. We shall often suppose that each unit has n inputs, but there is no particular reason why all units should have the same number of inputs. 7
  • 9. Antonia J. Jones: 6 November 2005 Although it is often convenient for theoretical discussions to consider fully interconnected networks, for very large networks of either real or artificial neurons the relevant case is that of relatively sparse connectivity. The connectivity graph then describes the fine topology of the network. This can be useful in practical applications, for example in speech recognition networks it is often helpful to have several copies of the same sub-net connected to temporally distinct inputs. These sub-net copies act as a feature detector and so can share their weights - this effectively reduces the number of parameters needed to describe the full network and speeds up learning. It is sufficient to be given a list of inputs and outputs for each node, for we then can recover the connectivity graph. ! A set of parameters pi1,...,pik, fixed in number, attached to each unit ui, which are adjusted during learning. Most commonly k = n and the parameters are weights wij (1 j n), where wij is often taken # # to be associated with the link from j to i, or in biological terms associated with the synaptic gap. ! An activation function for each unit, neti = neti(x1,...,xn;pi1,...,pik), which combines the inputs to ui into a scalar value. In the commonly used model neti = wijxj. ' It is important to realise that the basic principle of neural networks is that of simple (but arbitrary) computational function at each node. Learning when it occurs can be considered as an adjustment of the parameters associated with a node based on information locally available to the node. ‘Locally’ here means as specified by the connectivity graph. This information often takes the form of a correlation between the firings of adjacent nodes, but it could be a more sophisticated calculation. Thus we are really dealing with a very general class of parallel algorithms. The concentration on the ‘weights associated with links’ model has arisen partly because of the biological precedent, because of the extreme simplicity of the computational function of a node, and because this special case has been shown to be of practical interest. x1(t) x2(t) x3(t) ... xn(t) neti = neti(x1, ... ,xn, pi1, ... , pik) unit i output xi xi = f(neti) neti input Sigmoidal output function xi(t+1) ... xi(t+1) ... xi(t+1) Figure 1-1 The stylised version of a standard connectionist neuron. ! An output function xi = f(neti) which transforms the activation function into an output. In the earliest models f was a discontinuous step function. However, this poses analytical difficulties for learning algorithms so that often now f is a smooth sigmoidal shaped function. In some models f is allowed to vary 8
  • 10. Antonia J. Jones: 6 November 2005 from one unit to another and so then we write fi for f. ! A learning rule whereby the parameters associated with each processing unit are modified by experience. ! An environment within which the system must operate. A set of processing units. Figure 1-1 illustrates a standard connectionist component. All of the processing of a a connectionist system is carried out by these units. There is no executive or overseer. There are only relatively simple units, each doing its own relatively simple job. A unit's job is simply to receive input from other units and, as a function of the input it receives and the current values of its internal parameters, to compute an output value xi which it sends to the other units. This output is discrete in some models and continuous in others. When the output is continuous it is often confined to [0,1] or [-1,1]. The system is inherently parallel in that many units carry out their computations at the same time. Within any system we are modelling, it is sometimes useful to characterize three types of units: input, output, and hidden. The hidden units are those whose inputs and outputs are within the system we are modelling. They are not ‘visible’ to outside systems. A connectivity graph. Each unit passes its output to other units along links. The graph of links represents the connectivity of the network. A set of parameters and an activation function. In the conventional model the parameters for unit i are assumed to be weights wij associated with the link from unit j to unit i. If wij > 0 the link is said to be an excitatory link, if wij = 0 unit j is effectively not connected to unit i, and if wij < 0 the link is said to be inhibitory link. In this case neti is calculated as n net i j ' wijx j (1) j ' 1 This is a linear function of the inputs and so neti is constant over hyperplanes in the n-dimensional space of inputs to unit i. In fact, if one is interested in generalising the computational function of a unit, it is often convenient to associate the parameters (in the conventional case weights) with the unit. In which case one thinks of the links as passing activation values and one is no longer constrained to have exactly n (the number of inputs) parameters per unit. For example, one could have a unit which performed it's distinction function by determining whether or not the input vector lay within some ellipsoid. In this case there would be n parameters associated with the centre of the ellipsoid and another n parameters associated with the axes. (In addition one could provide the ellipsoid with rotations which would provide further parameters.) Now the activation function would look like n net i j ' Aij (x j & cij)2 (2) j ' 1 This is a simple example of a higher order network in which the function neti is not a linear function of the inputs. An output function. The simplest possible output function f would be the identity function, i.e. just take xi = neti. However, in this case with the activation function (1) the unit would be performing a totally linear function on the inputs and, as it turns out, such nets are rather uninteresting. In any event our unit is not yet making a distinction. In the discrete model the output function is usually 9
  • 11. Antonia J. Jones: 6 November 2005 xi ' 1 > θ i if net i (3) xi ' 0 # θ i where i is the threshold, a parameter associated with the unit. However, this creates discontinuities of the 2 derivatives and so we usually smooth the output function and write x i f(net i) ' (4) In the linear case f is some sort of sigmoidal function. For our ellipsoidal example Gaussian smoothing might be suitable, i.e. f(x) = exp(-x2), so that the output is large (near one) when the input vector is near the centre of the ellipsoid. Sometimes the output function is stochastic so that the output of the unit depends in a probabilistic fashion on neti. For an individual unit the sequence of events in operational mode (not learning) is 1. Combine inputs to produce activation neti(t). 2. Compute value of output xi = f(neti). 3. Place outputs, based on new activation level, on output links (available from t+1 onward). Changing the processing or knowledge structure in a connectionist model involves modifying the patterns of interconnections or parameters associated with each unit. This is accomplished by modifying pi1,...,pik (or the wij in the usual model) through experience using a learning rule. Virtually all learning rules are based on some variant of a Hebbian principle (discussed in the next section) which is invariably derived mathematically through some form of gradient descent. For example, the Delta or Widrow-Hoff rule. Here modification of weights is proportional to the difference between the actual activation achieved and the target activation provided by a teacher ) wij = (ti(t)-neti(t))xj(t), 0 where > 0 is constant. This is a generalization of the Perceptron learning rule and is all very well provided we 0 know the desired values of ti(t). Hebbian learning. Donald O. Hebb's book The Organization of Behavior (1949) is famous among neural modelers because it contained the first explicit statement of the physiological learning rule for synaptic modification that has since become known as the Hebb synapse: Hebb rule. When an axon of a cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased. The physiological basis for this synaptic potentiation is now understood more clearly [Brown 1988]. Hebb's introduction to the book also contains the first use of the word 'connectionism' in the context of neural modeling. The Hebb rule is not a mathematical statement, though it is close to one. For example, Hebb does not discuss the various possible ways inhibition might enter the picture, or the quantitative learning rule that is being followed. This has meant that a number of quite different learning rules can legitimately be called 'Hebbian rules'. We shall see later that nearly all such learning rules bear a close mathematical relationship to the idea of `gradient descent', which roughly means that if we wish to move to the lowest point of some error surface a good heuristic is: we should always tend to go `downhill'. However, for the present chapter we shall conceptualise the Hebb rule in terms of autocorrelations, i.e. the internal correlations between each pair of components of the pattern vectors we wish 10
  • 12. Antonia J. Jones: 6 November 2005 the system to memorise. Hebb was keenly aware of the `distributed' nature of the representation he assumed the nervous system uses; that to represent something assemblies of many cells are required and that an individual cell may be a participant member of many representations at different times. He postulated the formation of cell assemblies representing learned patterns of activity. The need for machine learning. Why do we need to discover how to get machines to learn? After all, is it not the case that the most practical developments in Artificial Intelligence, such as Expert Systems, have emerged from the development of advanced symbolic programming languages such as LISP or Prolog? Indeed, this is so. But there are convincing arguments [Bock 1985] which suggest that the technique of simulating human skills using symbolic programs cannot hope, in the long run, to satisfy the principal goals of AI. Mainly these centre around the time it would take to figure out the rules and write the software. But first we should consider the evolution of hardware. How can one measure the overall computational power of an information processing system? There are two obvious aspects we should consider. Firstly, information storage capacity - a system cannot be very smart if it has little or no memory. On the other hand, a system may have a vast memory but little or no capacity to manipulate information; so a second essential measure is the number of binary operations per second. On these two scales Figure 1-2 illustrates the information processing capability of some familiar biological and technological information processing systems. In the case of the biological systems these estimates are based on connectionist models and may be excessively conservative. We consider each axis independently. As we saw earlier, research in neurophysiology has revealed that the brain and central nervous system consists of about 1011 individual parallel processors, called neurons. Each neuron has roughly 104 synaptic connections and if we allow only 1 bit per synapse then each neuron is capable of storing about 104 bits of information. The information capacity of the brain is thus about 1015 bits. Much of this information is probably redundant but using this figure as a conservative estimate let us consider when we might expect to have high-speed memories of 1015 bits. 11
  • 13. Antonia J. Jones: 6 November 2005 Figure 1-2 Information processing capability [From: Mind Children, Hans Moravec]. Figure 1-3 shows that the amount of high-speed random access memory that may be conventionally accessed by a large computer has increased by an order of magnitude every six years. If we can trust this simple extrapolation, in generation thirteen, AD 2024-30, the average high speed memory capacity of a large computer will reach 1015 bits. Now consider the evolution of technological processing power. Remarkably, this follows much the same trend. Of course, the real trick is putting the two together to achieve the desired result, it seems relatively unlikely that we shall be in a position to accomplish this by 2024. So much for the hardware. Now consider the software. Even adult human brains are not filled to capacity. So we will assume that 10% of the total capacity, i.e. 1014 bits, is the extent of the `software' base of an adult human brain. How long will it take to write the Figure 1-3 Storage capacity. 12
  • 14. Antonia J. Jones: 6 November 2005 programs to fill 1014 bits (production rules, knowledge bases etc.)? The currently accepted rate of production of software, from conception through testing, de-bugging and documentation to installation, is about one line of code per hour. Assuming, generously, that an average line of code contains approximately 60 characters, or 500 bits, we discover that the project will require 100 million person years! We'll never get anywhere by trying to program human intelligence into a machine. What other options are available? One is direct transfer from the human brain to the machine. Considering conventional transfer rates over a high speed bus this would take about 12 days. The only problem is: nobody has the slightest idea how to build such a device. What's left? In the biological world intelligence is acquired every day, therefore there must be another alternative. Every day babies are born and in the course of time acquire a full spectrum of intelligence. How do they do it? The answer, of course, is that they learn. If we assume that the eyes, our major source of sensory input, receive information at the rate of about 250,000 bits per second, we can fill the 1014 bits of our machine's memory capacity in about 20 years. Now storing sensory input is not the same thing as developing intelligence, however this figure is in the right ball park. Maybe what we must do is connect our machine brain to a large number of high-data-rate sensors, endow it with a comparatively simple algorithm for self organization, provide it with a continuous and varied stream of stimuli and evaluations for its responses, and let it learn. This argument may seem cavalier in some aspects. The human brain is highly parallel and somewhat inhomogeneous in its architecture. It does not clock at high serial speeds and does not access RAM to recall information for processing. The storage capacity may be vastly greater than the 1015 bits estimated by Sagan, since each neuron is connected to as many as 10,000 others and the structure of these interconnections may also store information. Indeed, although we do not know a great deal about the mechanisms of human memory, we do know that is multi-levelled with partial bio-chemical storage. However, none of this invalidates Bock's point that programming can never be a substitute for learning. 13
  • 15. Antonia J. Jones: 6 November 2005 II Genetic Algorithms Introduction. The idea that the process of evolutionary search might be used as a model for hard combinatoric search algorithms developed significantly in the mid 1960's. Evolutionary algorithms fall into the class of probabilistic heuristic algorithms which one might use to attack NP-complete or NP-hard problems (see, for example [Horowitz 1978], Chapters 11 and 12), such as the Travelling Salesman/person Problem (TSP). Of course, many of these problems have significant applications in engineering hardware or software design and commercial optimisation problems, but the underlying motivation for the study of evolutionary algorithms is principally to try to gain insight into the evolutionary process itself. Variously known as genetic algorithms, the phrase coined by the US school stemming from the work of John Holland [Holland 1975], evolutionary programming, originally developed by L. J. Fogel, A. J. Owens and M. J. Walsh, again in the US, and Evolutionsstrategie, as studied in Germany at around the same time by I. Rechenberg and H-P. Schwefel [Schwefel 1965], the subject has exploded over the last 15 years. Curiously, the European and US schools seemed largely unaware of each others existence for quite some while. Evolutionary algorithms, have been applied to a variety of problems and offer intriguing possibilities for general purpose adaptive search algorithms in artificial intelligence, especially, but not necessarily, for situations where it is difficult or impossible to precisely model the external circumstances faced by the program. Search based on evolutionary models had, of course, been tried before Holland's introduction of genetic algorithms. However, these models were based on mutation and not notably successful. The principal difference of the more modern research is an emphasis on the power of natural selection and the incorporation of a ‘crossover’ operator to mimic the effect of sexual reproduction. Two rather different types of theoretical analysis have developed for evolutionary algorithms: the classical approach stemming from the original work of Mendel on heritability and the later statistical work of Galton and Pearson at the end of the last century, and the Schema theory approach developed by Holland. Mendel constructed a chance model of heritability involving what are now called genes. He conjectured the existence of genes by pure reasoning - he never saw any. Galton and Pearson found striking statistical regularities in heritability in large populations, for example, on average a son is halfway between his father's height and the overall average height for sons. They also invented many of the statistical tools in use today such as the scatter diagram, regression and correlation (see, for example, [Freedman 1991]). Around 1920 Fisher, Wright and Haldane more or less simultaneously recognised the need to recast Darwinian theory as described by Galton and Pearson in Mendelian terms. They succeeded in this task, and more recently Price's Covariance and Selection Theorem [Price 1970], [Price 1972], an elaboration of these ideas, has provided a useful tool for algorithm analysis. The archetypal GA. In Nature each gene has several forms or alternatives - alleles - producing differences in the set of characteristics associated with that gene, e.g. certain strains of garden pea have a single gene which determines blossom colour, one allele causing the blossom to be white, the other pink. There are tens of thousands of genes in the chromosomes of a typical vertebrate, each of which, on the available evidence, has several alleles. Hence the set of chromosomes attained by taking all possible combinations of alleles contains on the order of 10 to the 3,000 structures for a typical vertebrate species. Even a very large population, say 10 billion individuals, contains only a minuscule fraction of the possibilities. 14
  • 16. Antonia J. Jones: 6 November 2005 A further complication is that alleles interact so that adaptation becomes primarily the search for co-adapted sets of alleles. In the environment against which the organism is tested any individual exemplifies a large number of possible `patterns of co-adapted alleles' or schema, as Holland calls them. In testing this individual we shall see that all schema of which the individual is an instantiation are also tested. If the rules whereby genes are combined have a tendency to generate new instances of above average schema then the resulting adaptive system has a high degree of `intrinsic parallelism'1 which accelerates the evolutionary process. Considerations of this type offer an explanation of how evolution can proceed at all. If a simple enumerative plan were employed and if 10 to the 12 structures could be tried every second it would take a time vastly exceeding the estimated age of the universe to test 10 to the 100 structures. The basic idea of an evolutionary algorithm is illustrated in Figure 2-1. INITIALISE Create initial population Evaluate fitness of each member. INTERNAL EXTERNAL Create children from existing population using genetic operators Evaluate fitness of children Substitute children in population deleting an equivalent number Figure 2-1 Generic model for a genetic algorithm. We seek to optimise members of a population of ‘structures’. These structures are encoded in some manner by a ‘gene string’. The population is then ‘evolved’ in a very stylised version of the evolutionary process. We are given a set, A, of `structures' which we can think of, in the first instance, as being a set of strings of fixed length l. The object of the adaptive search is to find a structure which performs well in terms of a measure of performance v : A --> +, where + denotes the positive real numbers. ú ú 1 The notion of 'intrinsic parallelism' will be discussed but it should be mentioned that it has nothing to do with parallelism in the sense normally intended in computing. 15
  • 17. Antonia J. Jones: 6 November 2005 The programmer must provide a representation for the structures to be optimised. In the terminology of genetic algorithms a particular structure is called a phenotype and its representation as a string is called a chromosome or genotype. Usually this representation consists of a fixed length string in which each component, or gene, may take only a small range of values, or alleles. In this context `small' often means two, so that binary strings are used for the genotypes. There is nothing obligatory in taking a one-bit range for each allele but there are theoretical reasons to prefer few-alleles-at-many-sites over many-alleles-at-few-sites (the arguments have been given by [Holland 1975](p. 71), [Smith 1980](p. 56) and supporting evidence for the correctness of these arguments has been presented by [Schaffer 1984](p. 107). 1. Randomly generate a population of M structures S(0) = {s(1,0),...,s(M,0)}. 2. For each new string s(i,t) in S(t), compute and save its measure of utility v(s(i,t)). 3. For each s(i,t) in S(t) compute the selection probability defined by p(i,t) = v(s(i,t))/ ( E i v(s(i,t)) ). 4. Generate a new population S(t+1) by selecting structures from S(t) via the selection probability distribution and applying the idealised genetic operators to the structures generated. 5. Goto 2. Algorithm 2-1 Archetypal genetic algorithm. The function v provides a measure of ‘fitness’ for a given phenotype and (since the programmer must also supply a mapping from the set of genotypes to the set of phenotypes) hence for a given genotype. Given a particular n genotype or string the goal function provides a means for calculating the probability that the string will be selected to contribute to the next generation. It should be noted that the composition function v( ) mapping genotypes to n fitness is invariably discontinuous, nevertheless genetic algorithms cope remarkably well with this difficulty. The basis of Darwinian evolution is the idea of natural selection i.e. population genetics tends to use the Selection Principle. The fitness of an individual is proportional to the probability that it will reproduce effectively.2 In genetic algorithm design we tend to apply this in the converse form: the probability that an individual will reproduce is proportional to its fitness. ‘Fit’ strings, i.e. strings having larger goal function values, will be more likely to be selected but all members of the population will have some chance to contribute. 2 Obfuscation of the definition of ‘fitness’ occurs frequently in the classical literature. The reasons are not difficult to understand. Both Darwin and Fisher found it hard to swallow that the lower classes bred more prolifically and were therefore, by definition, ‘fitter’ than their ‘social superiors’. This confusion regarding ‘fitness’ still occurs in the GA’s literature for different reasons. 16
  • 18. Antonia J. Jones: 6 November 2005 The box contains a sketch of the standard serial style genetic algorithm. Typically the evaluation of the goal function for a particular phenotype, a process which strictly speaking is external to the genetic algorithm itself, is the most time consuming aspect of the computation. Given the mapping from genotype to phenotype, the goal function, and an initial random population the genetic algorithm proceeds to create new members of the population (which progressively replace the old members) using genetic operators, typically mutation, crossover and inversion, modelled on their biological analogs. For the moment we represent strings as a1a2a3...al [ai = 1 or 0]. Using this notation we can describe the operators by which strings are combined to produce new strings. It is the choice of CROSSOVER these operators which produces a search strategy that exploits co-adapted sets of cut points structural compon en t s a l r eady discovered. Holland uses three such Parent 1 1011 010011 10111 principal operators Crossover, Mutation and Inversion (which we shall not Parent 2 1100 111000 11010 discuss in detail here). Child 1 1100 010011 11010 Crossover. In crossover one or more cut points are selected at random and the Child 2 1011 111000 10111 operation illustrated in Figure 2-2, Figure 7-1 (where two cut points are employed) is used to create two children. MUTATION A variety of control regimes are possible, but a simple strategy might be `select 110011100011010 111011101011010 one of the children at random to go into the next generation'. Children tend to be `like' their parents, so that crossover can be considered as a focussing operator INVERSION which exploits knowledge already gained, its effects are quite quickly 111111100011010 110011111011010 apparent. Crossing over proceeds in three steps. Figure 2-2 Standard genetic operators. a) Two structures a1...al and b1...bl are selected at random from the current population. b) A crossover point x, in the range 1 to l-1 is selected, again at random. c) Two new structures a1a2...axbx+1bx+2...bl b1b2...bxax+1ax+2...al are formed. In modifying the pool of schema (discussed below), crossing over continually introduces new schema for trial whilst testing extant schema in new contexts. It can be shown that each crossing over affects a great number of schema. 17
  • 19. Antonia J. Jones: 6 November 2005 There is large variation in the crossover operators which have been used by different experimenters. For example, it is possible to cross at more than one point. The extreme case of this is where each allele is randomly selected from one or other parent string with uniform probability - this is called uniform crossover. Although some writers have argued in favour of uniform crossover, there would seem to be theoretical arguments against its use viz. if evolution is the search for co-adapted sets of alleles then this search is likely to be severely undermined if many cut points are used. In language we shall develop shortly: the probability of schema disruption when using uniform crossover is much higher than when using one or two point crossover. The design of the crossover operator is strongly influenced by the nature of the representation. For example, if the problem is the TSP and the representation of a tour is a straightforward list of cities in the order in which they are to be visited then a simple crossover operator will, in general, not produce a tour. In this case the options are: ! Change the representation. ! Modify the crossover operator. or ! Effect ‘genetic repair’ on non-tours which may result. There is obviously much scope for experiment for any particular problem. The danger is that the resulting algorithm may be so far removed from the canonical form that the correlation between parental and child fitness may be small - in which case the whole justification for the method will have been lost. Mutation. In mutation an allele is altered at each site with some fixed probability. Mutation disperses the population throughout the search space and so might be considered as an information gathering or exploration operator. Search by mutation is a slow process analogous to exhaustive search. Thus mutation is a ‘background’ operator, assuring that the crossover operator has a full range of alleles so that the adaptive plan is not trapped on local optima. Each structure a1a2...al in the population is operated upon as follows. Position x is modified, with probability p independent of the other positions, so that the string is replaced by a1a2...ax-1 z ax+1...al where z is drawn at random from the possible values. If p is the probability of mutation at a single position then the probability of h mutations in a given string is determined by a Poisson distribution with parameter p. A simple demonstrator is given in the Mathematica program GA_Simple.nb. A more complicated GA using Inversion is given in GA_Inversion.nb. Design issues - what do you want the algorithm to do? Now we have to ask just what is we want of a genetic algorithm. There are several, sometimes mutually exclusive, possibilities. For example: ! Rapid convergence to a global optimum. ! Produce a diverse population of near optimal solutions in different ‘niches’. ! Be adaptive in ‘real-time’ to changes in the goal function. We shall deal with each of these in turn but first let us briefly consider the nature of the search space. If the space is flat with just one spike then no algorithm short of exhaustive search will suffice. If the space is smooth and unimodal then a conventional hill-climbing technique should be used. 18
  • 20. Antonia J. Jones: 6 November 2005 Somewhere between these two extremes are problems in which the goal function is a highly non-linear multi- modal function of the gene values - these are the problems of hard combinatoric search for which some style of genetic algorithm may be appropriate. Rapid convergence to a global optimum. Of course this is rather simplistic. Holland's theory holds for large populations. However, in many AI applications it is computationally infeasible to use large populations and this in turn leads to a problem commonly referred to as Premature Convergence (to a sub-optimal solution) or Loss of Diversity in the literature of genetic algorithms. When this occurs the population tends to become dominated by one relatively good solution and locked into a sub- optimal region of the search space. For small populations the schema theorem is actually an explanation for premature convergence (i.e. the failure of the algorithm) rather than a result which explains success. Premature convergence is related to a phenomenon observed in Nature. Allelic frequencies may fluctuate purely by chance about their mean from one generation to another; this is termed Random Genetic Drift. Its effect on the gene pool in a large population is negligible, but in a small effectively interbreeding population, chance alteration in Mendelian ratios can have a significant effect on gene frequencies and can lead to the fixation of one allele and loss of another. For example, isolated communities within a given population have been found to have frequencies for blood group alleles different from the population as a whole. Figure 2-3 illustrates this phenomenon with a simple function optimisation genetic algorithm. The inexperienced often tend to attempt to counteract premature convergence by increasing the rate of mutation. However, this is not a good idea. ! A high rate of mutation tends to devalue the role of crossover in building co-adapted sets of alleles and in essence pushes the algorithm in the direction of exhaustive search. Whilst some mutation is necessary a high rate of mutation is invariably counter-productive. In trying to counteract premature convergence we are essentially trying to balance the exploitation of good solutions found so far against the exploration which is required to find hitherto unknown promising regions of the search space. It is worth observing that, in computational terms, any algorithm which often inserts copies of strings into the current population is wasteful. Figure 2-3 Premature convergence - no sharing. This is true for the Traditional Genetic Algorithm (TGA) outlined as 2, 7-1. Produce a diverse population of near optimal solutions in different `niches'. The problem of premature convergence has been addressed by a number of authors using a diversity of techniques. Many of the papers in [Davis 1987] contain discussions of precisely this point. The methods used to combat premature convergence in TGAs are not necessarily appropriate to the parallel formulations of genetic algorithms (PGAs) which we shall discuss shortly. Cavicchio, in his doctoral dissertation, suggested a preselection mechanism as a means of promoting genotype diversity. Preselection filters children generated, possibly picking the fittest, and replaces parent members of the population with their offspring [Cavicchio 1970]. 19
  • 21. Antonia J. Jones: 6 November 2005 De Jong's crowding scheme is an elaboration of the preselection mechanism. In the crowding scheme, an offspring replaces the most similar string from a randomly drawn subpopulation having size CF (the crowding factor) of the current population. Thus a member of the population experiences a selection pressure in proportion to its similarity to other members of the population [De Jong 1975]. Empirical determination of CF with a five function test bed determined CF = 3 as optimal. Booker implemented a sharing method in a classifier system environment which used the bucket brigade algorithm [Booker 1982]. The idea here was that if related rules share payments then sub-populations of rules will form naturally. However, it seems difficult to apply this mechanism to standard genetic algorithms. Schaffer has extended the idea of sub-populations in his VEGA model in which each fitness element has its own sub-population [Schaffer 1984]. A different approach to help maintain genotype diversity was introduced by Mauldin via his uniqueness operator [Mauldin 1984]. The uniqueness operator helped to maintain diversity by incorporating a `censorship' operator in which the insertion of an offspring into the population is possible only if the offspring is genotypically different from all members of the population at a number of specified genotypical loci. * Results and methods related to the TSP. We digress briefly to give a little more detailed background material on the TSP. The question is often asked: if one cannot exactly solve any very large TSP problem (except in special cases at present `very large' means a problem involving more than a thousand cities) how can one know how accurate a solution produced by a probabilistic or heuristic algorithm actually is? The best exact solution methods for the travelling salesman problem are capable of solving problems of several hundred cities [Grötschel 1991], but unfortunately excessive amounts of computer time are used in the process and, as N increases, any exact solution method rapidly becomes impractical. For large problems we therefore have no way of knowing the exact solution, but in order to gauge the solution quality of any algorithm we need a reasonably accurate estimate of the minimal tour length. This is usually provided in one of two ways. For a uniform distribution of cities the classic work by Beardwood, Halton and Hammersley (BHH) [Beardwood 1959] obtains an asymptotic best possible upper bound for the minimum tour length for large N. Let {Xi}, 1 i < # 4 , be independent random variables uniformly distributed over the unit square, and let LN denote the shortest closed path which connects all the elements of {X1,...,XN}. In the case of the unit square they proved, for example, that there is a constant c > 0 such that, with probability 1, 1/2 lim LN N & ' c (1) N 46 where c > 0 is a constant. In general c depends on the geometry of the region considered. One can use the estimate provided by the BHH theorem in the following form: the expected length LN* of a minimal tour for an N-city problem, in which the cities are uniformly distributed in a square region of the Euclidean plane, is given by ( LN . c2 NR (2) where R is the area of the square and the constant (for historical reasons known as Stein's constant - [Stein 1977]) c2 0.70805 ± 0.00007, recently been estimated by Johnson, McGeogh and Rothberg [Johnson 1996]. . A second possibility would be to use a problem specific estimate of the minimal tour length which gives a very accurate estimate: the Held-Karp lower bound [Held 1970], [Held 1971]. Computing the Held-Karp lower bound is an iterative process involving the evaluation of Minimal Spanning Trees for N-1 cities of the TSP followed by Lagrangean relaxations, see [Valenzuela 1997]. 20
  • 22. Antonia J. Jones: 6 November 2005 If one seeks approximate solutions then various algorithms based on simple rule based heuristics (e.g. nearest neighbour and greedy heuristics), or local search tour improvement heuristics (e.g. 2-Opt, 3-Opt and Lin- Kernighan), can produce good quality solutions much faster than exact methods. A combinatorial local search algorithm is built around a `combinatoric neighbourhood search' procedure, which given a tour, examines all tours which are closely related to it and finds a shorter `neighbouring' tour, if one exists. Algorithms of this type are discussed in [Papadimitriou 1982]. The definition of `closely related' varies with the details of the particular local search heuristic. The particularly successful combinatorial local search heuristic described by Lin and Kernighan [Lin 1973] defines `neighbours' of a tour to be those tours which can be obtained from it by doing a limited number of interchanges of tour edges with non-tour edges. The slickest local heuristic algorithms3, which on average tend to have complexity O(n ), for > 2, can produce solutions with approximately 1-2% excess for 1000 cities in a few " " minutes. However, for 10,000 cities the time escalates rapidly and one might expect that the solution quality also degrades, see [Gorges-Schleuter 1990], p 101. An approximation scheme A is an algorithm which given problem instance I and > 0 returns a solution of length , A(I, ) such that , A(I, ) Ln(I) * ε & * # ε (3) Ln(I) Such an approximation scheme is called a fully polynomial time approximation scheme if its run time is bounded by a function that is polynomial in both the instance size and 1/ . Unfortunately the following theorem holds, see , for example [Lawler 1985], p165-166. Theorem. If V N then there can be no fully polynomial time approximation scheme for the TSP, even if P V instances are restricted to points in the plane under the Euclidean metric. Although the possibility of a fully polynomial time approximation scheme is effectively ruled out, there remains the possibility of an approximation scheme that although it is not polynomial in 1/ , does have a running time , which is polynomial in n for every fixed > 0. The Karp algorithms, based on cellular dissection, provide , `probabilistic' approximation schemes for the geometric TSP. Theorem [Karp 1977]. For every > 0 there is an algorithm A( ) such that A( ) runs in time C( )n+O(nlogn) , , , , and, with probability 1, A( ) produces a tour of length not more than 1+ times the length of a minimal tour. , , The Karp-Steele algorithms [Steele 1986] can in principle converge in probability to near optimal tours very rapidly. Cellular dissection is a form of divide and conquer. Karp's algorithms partition the region R into small subregions, each containing about t cities. An exact or heuristic method is then applied to each subproblem and the resulting sub-tours are finally patched together to yield a tour through all the cities. Evolutionary Divide and Conquer. Until recently the best genetic algorithms designed for TSP problems have used permutation crossovers for example [Davis 1985], [Goldberg 1985], [Smith 1985], or edge recombination operators [Whitley 1989], and required massive computing power to gain very good approximate solutions (often actually optimal) to problems with a few hundred cities [Gorges-Schleuter 1990]. Gorges-Schleuter cleverly exploited the architecture of a transputer bank to define a topology on the population and introduce local mating schemes which enabled her to delay the onset of premature convergence. However, this improvement to the genetic algorithm is independent of 3 The most impressive results in this direction are due to David Johnson at AT&T Bell Laboratories - mostly reported in unpublished Workshop presentations. 21
  • 23. Antonia J. Jones: 6 November 2005 any limitations inherent in permutation crossovers. Eventually, for problems of more than around 1000 cities, all such genetic algorithms tend to produce a flat graph of improvement against number of individuals tested, no matter how long they are run. Thus experience with genetic algorithms using permutation operators applied to the Geometric Travelling Salesman Problem (TSP) suggests that these algorithms fail in two respects when applied to very large problems: they scale rather poorly as the number of cities n increases, and the solution quality degrades rapidly as the problem size increases much above 1000 cities. An interesting novel approach developed by Valenzuela and Jones [Valenzuela 1994] which seeks to circumvent these problems is based on the idea of using the genetic algorithm to explore the space of problem subdivisions, rather than the space of solutions itself. This alternative method, for genetic algorithms applied to hard combinatoric search, can be described as Evolutionary Divide and Conquer (EDAC), and the approach has potential for any search problem in which knowledge of good solutions for subproblems can be exploited to improve the solution of the problem itself. As they say ! Essentially we are suggesting that intrinsic parallelism is no substitute for divide and conquer in hard combinatoric search and we aim to have both. [Valenzulea 1994] The goal was to develop a genetic algorithm capable of producing reasonable quality solutions for problems of several thousand cities, and one which will scale well as the problem size n increases. `Scaling well' in this context almost inevitably means a time complexity of O(n) or at worst O(nlogn). This is a fairly severe constraint, for example given a list of n city co-ordinates the simple act of computing all possible edge lengths, a O(n2) operation is excluded. Such an operation may be tolerable for n = 5000 but becomes intolerable for n = 100,000. In the previous section we mentioned the Karp and Steele cellular disection algorithms, and it is this technique which is the basis of the Valenzuela-Jones EDAC genetic algorithms for the TSP. Figure 2-4 Solution to 50 City Problem using Karp's deterministic bisection method 1. 22
  • 24. Antonia J. Jones: 6 November 2005 In practice a one-shot deterministic Karp algorithm yields rather poor solutions, typically 30% excess (with simple patching) when applied to 500 - 1000 city problems. Nevertheless, the Karp technique is a good starting point for exploring EDAC applied to the TSP. There are several reasons. First, according to Karp's theorem there is some probabilistic asymptotic guarantee of solution quality as the problem size increases. Second, the time complexity is about as good as one can hope for, namely O(nlogn). The run time of a genetic algorithm based on exploring the space of `Karp-like' solutions will be proportional to nlogn multiplied by the number of times the Karp algorithm is run, i.e. the number of individuals tested. Karp's algorithm proceeds by partitioning the problem recursively from the top down. At each step the current rectangle is bisected horizontally or vertically, according to a deterministic rule designed to keep the rectangle perimeter minimal. This bisection proceeds until each Figure 2-5 The EDAC (top) and simple 2-Opt subrectangle contains a preset maximum number of cities (bottom) time complexity (log scales). t (typically t 10). Each small subproblem is then solved - and the resulting subtours are patched together to produce a solution to the original problem - see Figure 2-4 In the EDAC algorithm the genotype is a p X p binary array in which a `1' or `0' indicates whether to cut horizontally or vertically at the current bisection. If we maintain the subproblem size, t, and increase the number of cities in the TSP, then a partition better than Karp's becomes progressively harder to find by randomly choosing a horizontal or vertical bisection at each step. If the problem size is n 2kt, where 2k is the number of subsquares, - then the corresponding genotype requires at least n/t - 1 bits. The size of the partition space is 2 to the power p2, which for p = 80 (the value used for n = 5000) is approximately exp(4436). For n = 5000 the size of permutation search space, roughly estimated using Stirling's formula, is around exp(37586). Thus searching partition space is easier than searching permutation space and this provides a third argument in favour of exploring this representation of problem subdivision as a genotype. We know from Karp's theorem that the class of tours produced by disection and patching will have representatives very close to the optimum tour, so by restricting attention to this smaller set one is not `throwing out the baby with the bath-water', i.e. the set may be smaller but it nevertheless contains near optimal tours. This approach contrasts sharply with the idea of `broadcast languages' mooted in Chapter 8 of [Holland 1975], in which techniques for searching the space of representations for a genetic algorithm are discussed. In general the space of representations is vastly larger than the search space of the problem itself, but we have seen with the TSP that this space is already so huge that it is impractical to search in any comprehensive fashion for all except the smallest problems. Hence, it seems unlikely that replacing the original search space by an even larger one will turn out to be a productive approach. 23
  • 25. Antonia J. Jones: 6 November 2005 Figure 2-6 The 200 generation EDAC 5% excess solution for a 5000 city problem. In any event even the EDAC algorithm requires clever recursive repair techniques to improve the accuracy when subtours are patched together. Nevertheless, the algorithm scales well. Figure 2-5 compares the EDAC algorithms with simple 2-Opt (which gives an accuracy of around 8% excess). This version of the EDAC algorithm produces solutions at the 5% level, see Figure 2-6, but a later more elaborate variant reliably produces solutions with around 1% excess and has been tested on problems sizes of up to 10,000 cities. This technique probably represents the best that can be done at the present time using genetic algorithms for the TSP. It is not yet practical by comparison with iterated Lin-Kernighan (or even 2-Opt)4, but it scales well and may eventually offer a viable technique for obtaining good solutions to TSP problems involving several hundred thousand cities. Parallel EDACII and EDACIII were both tested on a range of problems between 500 and 5000 cities. Parental pairs were chosen from the initial random population and the mid-parent value of the tour lengths calculated and recorded. Crossover and mutation were then applied to each selected parental pair and the tour length evaluated for the resulting offspring. Pearson's correlation coefficient, rxy, was calculated in each experiment and significance tests based on Fisher's transformation carried out in order to establish whether the resulting correlation coefficients differed significantly from zero (i.e. no correlation). Scatter diagrams in Figure 2-7 and Figure 2-8 illustrate the Price correlation for parallel EDACII and EDACIII on the 5000 city problem. Although the genotype used in these experiments was a binary array it could more naturally (at the cost of complication in the coding) be represented by a pair of binary trees, or a quadtree. The use of trees here would be more in keeping with the recursive construction of the phenotype from the genotype, a process analogous to 4 For example, wildly extrapolating the figures gives the breakeven point with 2-Opt at around n = 422,800 requiring some 74 cpu days! Of course, other things would collapse before then. 24
  • 26. Antonia J. Jones: 6 November 2005 growth, and it is possible to produce a modified Schema theorem for the case of trees, where the genetic information is encoded in the shape of the tree and information placed at leaf nodes. 113.0 112.0 111.0 109.7 109.0 107.5 107.0 105.2 103.0 105.0 103.0 104.5 106.0 107.5 109.0 105.0 107.0 109.0 111.0 113.0 mid-parent value mid-parent value Figure 2-7 EDACII mid-parent vs offspring correlation Figure 2-8 EDACIII mid-parent vs offspring correlation for 5000 cities [Valenzuela 1995]. for 5000 cities [Valenzuela 1995]. In Nature very complex phenotypical structures frequently give the appearance of having been constructed recursively from the genotype. Examples of recursive algorithms which lead to very natural looking graphical representations of natural living structures such as trees, plants, and so on, can be found in the work of Lindenmayer [Lindenmayer 1971] on what are now called L-systems. These production systems are very similar to the production rules which define various kinds of context sensitive or context free grammars. The combination of tree structured genotypes, or recursive construction algorithms similar to production rules, combined with the divide-and-conquer paradigm suggest a powerful computational technique for the compression of complex phenotypical structures into useful genotypical structures. So much so that, as our understanding of exactly how DNA encodes the phenotypical structure of individual biological organisms (particularly the neural systems of mammals) progresses, it would be surprising if to find that Nature has not employed some such technique. Chapter references [Altenberg 1987] L. Altenberg and M. W. Feldman. Selection, generalised transmission, and the evolution of modifier genes. The reduction principle. Genetics 117:559-572. [Altenburg 1994] L. Altenberg. The Evolution of Evolvability in Genetic Programming. Chapter 3 in Advances in Genetic Programming, Ed Kenneth E. Kinnear, Jr., MIT Press, 1994. [Belew 1990] R. Belew, J. McInerney, and N. N. Schraudolph. Evolving networks: Using the genetic algorithm with connectionist learning. CSE Technical Report CS90-174, University of California, San Diego, 1990. [Booker 1982] L. B. Booker. Intelligent behaviour as an adaption to the task environment. Doctoral dissertation, University of Michigan, 1982. Dissertation Abstracts International 43(2), 469B. [Brandon 1990] R. N. Brandon. Adaptation and Environment, pages 83-84. Princeton University Press, 1990. [Cavalli-Sforza 1976] L. L. Cavelli-Sforza and M. W. Feldman. Evolution of continuous variation: direct approach through joint distribution of genotypes and phenotypes. Proceedings of the national Academy of Science U.S.A., 73:1689-1692, 1976. 25
  • 27. Antonia J. Jones: 6 November 2005 [Cavicchio 1970] D. J. Cavicchio. Adaptive search using simulated evolution. Doctoral dissertation, University of Michigan (unpublished), 1970. [Chalmers 1990] David J. Chalmers. The Evolution of Learning: An experiment in Genetic Connectionism. Proceedings of the 1990 Connectionist Models Summer School, San Marco, CA. Morgan Kaufmann, 1990. [Collins 1991] R. J. Collins and D. R. Jefferson. Selection in massively parallel genetic algorithms. Proceedings of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991. [Davis 1987] Lawrence Davis, Editor. Genetic Algorithms and Simulated Annealing, Pitman Publishing, London. [De Jong 1975] K. De Jong. An analysis of the behaviour of a class of genetic adaptive systems. Doctoral dissertation, University of Michigan, 1975. Dissertation Abstracts International 36(10), 5140B. [Freedman 1991] D. Freedman, R. Pisani, R. Purves and A. Adhikkari. Statistics, Second edition, W. W. Norton, New York, 1991. [Georges-Schleuter 1990] Martina Georges-Schleuter. Genetic Algorithms and Population Structures, A Massively Parallel Algorithm. Ph.D. Thesis, Department of Computer Science, University of Dortmund, Germany. August 1990. [Goldberg 1987] David E. Goldberg and Jon Richardson. Genetic Algorithms with Sharing for Multimodal Function Optimization. Proc. Second Int. Conf. on Genetic Algorithms, pp. 41-49, MIT. [Gorges-Schleuter 1990] Martina Gorges-Schleuter. Genetic Algorithms and Population Structures: A Massively Parallel Algorithm. Ph.D. Thesis, University of Dortmund, August 1990. [Grefenstette 1987] John J. Grefenstette. Incorporating Problem Specific Knowledge into Genetic Algorithms. In Genetic Algorithms and Simulated Annealing, Ed. Lawrence Davis, Pitman Publishing, London. [Holland 1975] John H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press. [Horowitz 1978] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. London, Pitman Publishing Ltd. [Johnson 1996] D. S. Johnson, L. A. McGeoch and E. E. Rothberg. Asymptotic experimental analysis for the Held- Karp traveling salesman bound. Proceeding 1996 ACM-SIAM symp. on Discrete Algorithms, to appear. [Jones 1993] Antonia J. Jones. Genetic Algorithms and their Applications to the Design of Neural Networks, Neural Computing & Applications, 1(1):32-45, 1993. [Koza 1992] John L. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Bradford Books, MIT Press, 1992. ISBN 0-262-11170-5. [Lindenmayer 1971] A. Lindenmayer. Developmental systems without cellular interaction, their languages and grammars. J. Theoretical Biology 30, 455-484, 1971. [Lyubich 1992] Y. I. Lyubich. Mathematical Structures in Population Genetics. Springer-Verlag, New York, pages 291-306. 1992. [Manderick 1989] B. Manderick and P. Spiessens. Fine grained parallel genetic algorithms. Proceedings of the third international conference on genetic algorithms. Morgan Kaufmann, 1989. 26
  • 28. Antonia J. Jones: 6 November 2005 [Manderick 1991] Manderick, B. de Weger, M. and Spiessens, P. The genetic algorithm and the structure of the fitness landscape. In R. K. Belew and L. B. Booker, Editors, Proceedings of the Fourth International Conference on Genetic Algorithms, pages 143-150, San Mateo CA, Morgan Kaufmann. [Mauldin 1984] M. L. Mauldin. Maintaining diversity in genetic search. National Conference on Artificial Intelligence, 247-250, 1984. [Macfarlane 1993] D. Macfarlane and Antonia J. Jones. Comparing networks with differing neural-node functions using Transputer based genetic algorithms. Neural Computing & Applications, 1(4): 256-267, 1993. [Menczer 1992] Menczer,F. and Parisi, D. Evidence of hyperplanes in the genetic learning of neural networks. Biological Cybernetics 66(3):283-289. [Miller 1989] G. Miller, P. Todd, and S. Hegde. Designing neural networks using genetic algorithms. In Proceedings of the Third Conference on Genetic Algorithms and their Applications, San Mateo, CA. Morgan Kaufmann, 1989. [Muhlenbein 1988] H. Muhlenbein, M. Gorges-Schleuter, and O. Kramer. Evolution Algorithms in Combinatorial Optimisation. Parallel Computing, 7, pp. 65-85. [Price 1970] G. R. Price. Selection and covariance. Nature, 227:520-521. [Price 1972] G. R. Price. Extension of covariance mathematics. Annals of Human Genetics 35:485-489. [Salmon 1971] W. C. Salmon. Statistical Explanation and Statistical Relevance. University of Pittsburg Press, Pittsburgh, 1971. [Schaffer 1984] J. D. Schaffer. Some Experiments in Machine Learning Using Vector Evaluated Genetic Algorithms. Ph.D. Thesis, Department of Electrical Engineering, Vanderbilt University, December 1984. [Schwefel 1965] H-P Schwefel. Kybernetische Evolution als Strategie experimentellen Forschung in der Strömungstechnik. Diploma thesis, Technical University of Berlin, 1965. [Slatkin 1970] M. Slatkin. Selection and polygenic characters. Proceedings of the National Academy of Sciences U.S.A. 66:87-93. 1970. [Smith 1980] S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms. Ph.D. Dissertation, University of Pittsburg. [Spiessens 1991] P. Spiessens and B. Manderick. A massively parallel genetic algorithm - implementation and first analysis. Proceedings of the fourth international conference on genetic algorithms. Morgan Kaufmann 1991. [Valenzuela 1994] Christine L. Valenzuela and Antonia J. Jones. Evolutionary Divide and Conquer (I): A novel genetic approach to the TSP. Evolutionary Computation 1(4):313-333, 1994. [Valenzuela 1995] Christine L. Valenzuela. Evolutionary Divide and Conquer: A novel genetic approach to the TSP. Ph.D. Thesis, Department of Computing, Imperial College, London. 1995 [Valenzuela 1997] Christine L. Valenzuela and Antonia J. Jones. Estimating the Held-Karp lower bound for the geometric TSP. To appear: European Journal of Operational Research, 1997. [Whitley 1990] D. Whitley, T. Starkweather, and C. Bogart. Genetic algorithms and neural networks: Optimizing connections and connectivity. Parallel Computing, forthcoming. 27
  • 29. Antonia J. Jones: 6 November 2005 [Wilson 1990] Perceptron redux. Physica D, forthcoming. 28
  • 30. Antonia J. Jones: 6 November 2005 III Hopfield networks. Introduction. As far back as 1954 Cragg and Temperley [Cragg 1954] had introduced the spin-neuron analogy using a ferromagnetic model. They remarked "It remains to be considered whether biological analogues can be found for the concepts of temperature and interaction energy in a physical system." In 1974 Little [Little 1974] introduced the temperature-noise analogy, but it was not until 1982 that John Hopfield [Hopfield 1982], a physicist, made significant progress in the direction requested by Cragg and Temperley. In a single short paragraph, he suggests one of the most important new techniques to have been proposed in neural networks. Hopfield nets and energy. The standard approach to a neural network is to propose a learning rule, usually based on synaptic modification, and then to show that a number of interesting effects arise from it. Hopfield starts by saying that: "The function of the nervous system is to develop a number of locally stable states in state space." Other points in state space flow into the stable points, called attractors. In some other dynamic systems the behaviour is much more complex, for example the system may orbit two or more points in state space in a non- periodic way, see [Abraham 1985]. However, this turns out not to be the case for the Hopfield net. The flow of the system towards a stable point allows a mechanism for correcting errors, since deviations from the stable points disappear. The system can thus reconstruct missing information since the stable point will appropriately complete missing parts of an incomplete initial state vector. Each of n neurons has two states, like those of McCulloch and Pitts, xi = 0 (not firing) and xi = 1 (firing at maximum rate). When neuron i has a connection made to it from neuron j, the strength of the connection is defined as wij. Non-connected neurons have wij = 0. The instantaneous state of the system is specified by listing the n values of xi, so is represented by a binary word of n bits. The state changes in time according to the following algorithm. For each neuron i there is a fixed threshold i. Each neuron readjusts its state randomly in time but with a mean 2 attempt rate µ, setting x i(t) ' 1 > θ i x i(t) ' x i(t 1) & if j wijxj(t 1) & ' θ i (1) x i(t) ' 0 j i … < θ i Thus, an element chosen at random asynchronously evaluates whether it is above or below threshold and readjusts accordingly. Although this model has superficial similarities to the Perceptron there are essential differences. Firstly, Perceptrons were modelled chiefly with the neural connections in a `forward' direction and the analysis of such networks with backward coupling proved intractable. All the interesting results of the Hopfield model arise as a consequence of the strong backward coupling. Secondly, studies of perceptrons usually made a random net of 29